A Novel Bayesian Cluster Enumeration Criterion for Unsupervised Learning
The Bayesian Information Criterion (BIC) has been widely used for estimating the number of data clusters in an observed data set for decades. The original derivation, referred to as classic BIC, does not include information about the specific model selection problem at hand, which renders it generic. However, very little effort has been made to check its appropriateness for cluster analysis. In this paper we derive BIC from first principle by formulating the problem of estimating the number of clusters in a data set as maximization of the posterior probability of candidate models given observations. We provide a general BIC expression which is independent of the data distribution given some mild assumptions are satisfied. This serves as an important milestone when deriving BIC for specific data distributions. Along this line, we provide a closed-form BIC expression for multivariate Gaussian distributed observations. We show that incorporating the clustering problem during the derivation of BIC results in an expression whose penalty term is different from the penalty term of the classic BIC. We propose a two-step cluster enumeration algorithm that utilizes a model-based unsupervised learning algorithm to partition the observed data according to each candidate model and the proposed BIC for selecting the model with the optimal number of clusters. The performance of the proposed criterion is tested using synthetic and real-data examples. Simulation results show that our proposed criterion outperforms the existing BIC-based cluster enumeration methods. Our proposed criterion is particularly powerful in estimating the number of data clusters when the observations have unbalanced and overlapping clusters.
READ FULL TEXT