A Sharp Bound on the Computation-Accuracy Tradeoff for Majority Voting Ensembles

03/04/2013
by   Miles E. Lopes, et al.
0

When random forests are used for binary classification, an ensemble of t=1,2,... randomized classifiers is generated, and the predictions of the classifiers are aggregated by majority vote. Due to the randomness in the algorithm, there is a natural tradeoff between statistical performance and computational cost. On one hand, as t increases, the (random) prediction error of the ensemble tends to decrease and stabilize. On the other hand, larger ensembles require greater computational cost for training and making new predictions. The present work offers a new approach for quantifying this tradeoff: Given a fixed training set D, let the random variables Err_t,0 and Err_t,1 denote the class-wise prediction error rates of a randomly generated ensemble of size t. As t→∞, we provide a general bound on the "algorithmic variance", var(Err_t,l|D)≤f_l(1/2)^2/4t+o(1/t), where l∈{0,1}, and f_l is a density function that arises from the ensemble method. Conceptually, this result is somewhat surprising, because var(Err_t,l|D) describes how Err_t,l varies over repeated runs of the algorithm, and yet, the formula leads to a method for bounding var(Err_t,l|D) with a single ensemble. The bound is also sharp in the sense that it is attained by an explicit family of randomized classifiers. With regard to the task of estimating f_l(1/2), the presence of the ensemble leads to a unique twist on the classical setup of non-parametric density estimation --- wherein the effects of sample size and computational cost are intertwined. In particular, we propose an estimator for f_l(1/2), and derive an upper bound on its MSE that matches "standard optimal non-parametric rates" when t is sufficiently large.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset