Chi-squared Amplification: Identifying Hidden Hubs

08/12/2016
by   Ravi Kannan, et al.
0

We consider the following general hidden hubs model: an n × n random matrix A with a subset S of k special rows (hubs): entries in rows outside S are generated from the probability distribution p_0 ∼ N(0,σ_0^2); for each row in S, some k of its entries are generated from p_1 ∼ N(0,σ_1^2), σ_1>σ_0, and the rest of the entries from p_0. The problem is to identify the high-degree hubs efficiently. This model includes and significantly generalizes the planted Gaussian Submatrix Model, where the special entries are all in a k × k submatrix. There are two well-known barriers: if k≥ c√(n n), just the row sums are sufficient to find S in the general model. For the submatrix problem, this can be improved by a √( n) factor to k > c√(n) by spectral methods or combinatorial methods. In the variant with p_0=± 1 (with probability 1/2 each) and p_1≡ 1, neither barrier has been broken. We give a polynomial-time algorithm to identify all the hidden hubs with high probability for k > n^0.5-δ for some δ >0, when σ_1^2>2σ_0^2. The algorithm extends to the setting where planted entries might have different variances each at least as large as σ_1^2. We also show a nearly matching lower bound: for σ_1^2 < 2σ_0^2, there is no polynomial-time Statistical Query algorithm for distinguishing between a matrix whose entries are all from N(0,σ_0^2) and a matrix with k=n^0.5-δ hidden hubs for any δ >0. The lower bound as well as the algorithm are related to whether the chi-squared distance of the two distributions diverges. At the critical value σ_1^2=2σ_0^2, we show that the general hidden hubs problem can be solved for k≥ c√(n)( n)^1/4, improving on the naive row sum-based method.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset