Efficient Distributed Algorithms for the K-Nearest Neighbors Problem
The K-nearest neighbors is a basic problem in machine learning with numerous applications. In this problem, given a (training) set of n data points with labels and a query point p, we want to assign a label to p based on the labels of the K-nearest points to the query. We study this problem in the k-machine model,[%s] a model for distributed large-scale data. In this model, we assume that the n points are distributed (in a balanced fashion) among the k machines and the goal is to quickly compute answer given a query point to a machine. Our main result is a simple randomized algorithm in the k-machine model that runs in O(log K) communication rounds with high probability success (regardless of the number of machines k and the number of points n). The message complexity of the algorithm is small taking only O(klog K) messages. Our bounds are essentially the best possible for comparison-based algorithms.[%s] due to the existence of a lower bound of Ω(log n) communication rounds for finding the median of 2n elements distributed evenly among two processors by Rodeh <cit.>. We also implemented our algorithm and show that it performs well compared to an algorithm (used in practice) that sends K nearest points from each machine to a single machine which then computes the answer.
READ FULL TEXT