k-Means Clustering of Lines for Big Data

03/16/2019
by   Yair Marom, et al.
0

The k-means for lines is a set of k centers (points) that minimizes the sum of squared distances to a given set of n lines in R^d. This is a straightforward generalization of the k-means problem where the input is a set of n points. Related problems minimize sum of (non-squared) distances, other norms, m-estimators or ignore the t farthest points (outliers) from the k centers. We suggest the first provable PTAS algorithms for these problems that compute (1+epsilon)-approximation in time O(n (n)/epsilon^2) for any given epsilon ∈ (0, 1), and constant integers k, d, t ≥ 1, including support for streaming and distributed input. Experimental results on Amazon EC2 cloud and open source are also provided.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset