Fully Scalable MPC Algorithms for Clustering in High Dimension

07/15/2023
by   Artur Czumaj, et al.
0

We design new algorithms for k-clustering in high-dimensional Euclidean spaces. These algorithms run in the Massively Parallel Computation (MPC) model, and are fully scalable, meaning that the local memory in each machine is n^σ for arbitrarily small fixed σ>0. Importantly, the local memory may be substantially smaller than k. Our algorithms take O(1) rounds and achieve O(1)-bicriteria approximation for k-Median and for k-Means, namely, they compute (1+ε)k clusters of cost within O(1/ε^2)-factor of the optimum. Previous work achieves only poly(log n)-bicriteria approximation [Bhaskara et al., ICML'18], or handles a special case [Cohen-Addad et al., ICML'22]. Our results rely on an MPC algorithm for O(1)-approximation of facility location in O(1) rounds. A primary technical tool that we develop, and may be of independent interest, is a new MPC primitive for geometric aggregation, namely, computing certain statistics on an approximate neighborhood of every data point, which includes range counting and nearest-neighbor search. Our implementation of this primitive works in high dimension, and is based on consistent hashing (aka sparse partition), a technique that was recently used for streaming algorithms [Czumaj et al., FOCS'22].

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset