CDF Transform-Shift: An effective way to deal with inhomogeneous density datasets
Many distance-based algorithms exhibit bias towards dense clusters in inhomogeneous datasets (i.e., those which contain clusters in both dense and sparse regions of the space). For example, density-based clustering algorithms tend to join neighbouring dense clusters together into a single group in the presence of a sparse cluster; while distance-based anomaly detectors exhibit difficulty in detecting local anomalies which are close to a dense cluster in datasets also containing sparse clusters. In this paper, we propose the CDF Transform-Shift (CDF-TS) algorithm which is based on a multi-dimensional Cumulative Distribution Function (CDF) transformation. It effectively converts a dataset with clusters of inhomogeneous density to one with clusters of homogeneous density, i.e., the data distribution is converted to one in which all locally low/high-density locations become globally low/high-density locations. Thus, after performing the proposed Transform-Shift, a single global density threshold can be used to separate the data into clusters and their surrounding noise points. Our empirical evaluations show that CDF-TS overcomes the shortcomings of existing density-based clustering and distance-based anomaly detection algorithms and significantly improves their performance.
READ FULL TEXT