Distributed k-Means with Outliers in General Metrics
Center-based clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is undoubtedly the k-means problem, which, given a set P of points from a metric space and a parameter k<|P|, requires to determine a subset S of k centers minimizing the sum of all squared distances of points in P from their closest center. A more general formulation, known as k-means with z outliers, introduced to deal with noisy datasets, features a further parameter z and allows up to z points of P (outliers) to be disregarded when computing the aforementioned sum. We present a distributed coreset-based 3-round approximation algorithm for k-means with z outliers for general metric spaces, using MapReduce as a computational model. Our distributed algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term O(γ) away from the one achievable by the best known sequential (possibly bicriteria) algorithm, where γ can be made arbitrarily small. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by the doubling dimension D of the metric space. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for general metrics.
READ FULL TEXT