On the Self-Penalization Phenomenon in Feature Selection
We describe an implicit sparsity-inducing mechanism based on minimization over a family of kernels: min_β, f 𝔼[L(Y, f(β^1/q⊙ X)] + λ_n f_ℋ_q^2 subject to β≥ 0, where L is the loss, ⊙ is coordinate-wise multiplication and ℋ_q is the reproducing kernel Hilbert space based on the kernel k_q(x, x') = h(x-x'_q^q), where ·_q is the ℓ_q norm. Using gradient descent to optimize this objective with respect to β leads to exactly sparse stationary points with high probability. The sparsity is achieved without using any of the well-known explicit sparsification techniques such as penalization (e.g., ℓ_1), early stopping or post-processing (e.g., clipping). As an application, we use this sparsity-inducing mechanism to build algorithms consistent for feature selection.
READ FULL TEXT