Phase Transitions in the Detection of Correlated Databases
We study the problem of detecting the correlation between two Gaussian databases 𝖷∈ℝ^n× d and 𝖸^n× d, each composed of n users with d features. This problem is relevant in the analysis of social media, computational biology, etc. We formulate this as a hypothesis testing problem: under the null hypothesis, these two databases are statistically independent. Under the alternative, however, there exists an unknown permutation σ over the set of n users (or, row permutation), such that 𝖷 is ρ-correlated with 𝖸^σ, a permuted version of 𝖸. We determine sharp thresholds at which optimal testing exhibits a phase transition, depending on the asymptotic regime of n and d. Specifically, we prove that if ρ^2d→0, as d→∞, then weak detection (performing slightly better than random guessing) is statistically impossible, irrespectively of the value of n. This compliments the performance of a simple test that thresholds the sum all entries of 𝖷^T𝖸. Furthermore, when d is fixed, we prove that strong detection (vanishing error probability) is impossible for any ρ<ρ^⋆, where ρ^⋆ is an explicit function of d, while weak detection is again impossible as long as ρ^2d→0. These results close significant gaps in current recent related studies.
READ FULL TEXT