Testing Identity of Multidimensional Histograms
We investigate the problem of identity testing for multidimensional histogram distributions. A distribution p: D →R_+, where D ⊆R^d, is called a k-histogram if there exists a partition of the domain into k axis-aligned rectangles such that p is constant within each such rectangle. Histograms are one of the most fundamental non-parametric families of distributions and have been extensively studied in computer science and statistics. We give the first identity tester for this problem with sub-learning sample complexity in any fixed dimension and a nearly-matching sample complexity lower bound. More specifically, let q be an unknown d-dimensional k-histogram and p be an explicitly given k-histogram. We want to correctly distinguish, with probability at least 2/3, between the case that p = q versus p-q_1 ≥ϵ. We design a computationally efficient algorithm for this hypothesis testing problem with sample complexity O((√(k)/ϵ^2) ^O(d)(k/ϵ)). Our algorithm is robust to model misspecification, i.e., succeeds even if q is only promised to be close to a k-histogram. Moreover, for k = 2^Ω(d), we show a nearly-matching sample complexity lower bound of Ω((√(k)/ϵ^2) ((k/ϵ)/d)^Ω(d)) when d≥ 2. Prior to our work, the sample complexity of the d=1 case was well-understood, but no algorithm with sub-learning sample complexity was known, even for d=2. Our new upper and lower bounds have interesting conceptual implications regarding the relation between learning and testing in this setting.
READ FULL TEXT