A flexible and robust non-parametric test of exchangeability
Many statistical analyses assume that the data points within a sample are exchangeable and their features have some known dependency structure. Given a feature dependency structure, one can ask if the observations are exchangeable, in which case we say that they are homogeneous. Homogeneity may be the end goal of a clustering algorithm or a justification for not clustering. Apart from random matrix theory approaches, few general approaches provide statistical guarantees of exchangeability or homogeneity without labeled examples from distinct clusters. We propose a fast and flexible non-parametric hypothesis testing approach that takes as input a multivariate individual-by-feature dataset and user-specified feature dependency constraints, without labeled examples, and reports whether the individuals are exchangeable at a user-specified significance level. Our approach controls Type I error across realistic scenarios and handles data of arbitrary dimension. We perform an extensive simulation study to evaluate the efficacy of domain-agnostic tests of stratification, and find that our approach compares favorably in various scenarios of interest. Finally, we apply our approach to post-clustering single-cell chromatin accessibility data and World Values Survey data, and show how it helps to identify drivers of heterogeneity and generate clusters of exchangeable individuals.
READ FULL TEXT