Boosting with Structural Sparsity

John Duchi and Yoram Singer

International Conference on Machine Learning (ICML 2009)

Conference paper

Long version, submitted.

Despite popular belief, boosting algorithms and related coordinate descent methods are prone to overfitting. We derive modifications to AdaBoost and related gradient-based coordinate descent methods that incorporate, along their entire run, sparsity-promoting penalties for the norm of the predictor that is being learned. The end result is a family of coordinate descent algorithms that integrates forward feature induction and back-pruning through sparsity promoting regularization along with an automatic stopping criterion for feature induction. We study penalties based on the L1, L2, and L-infinity norm of the learned predictor and also introduce mixed-norm penalties that build upon the initial norm-based penalties. The mixed-norm penalties facilitate structural sparsity of the parameters of the predictor, which is a useful property in multiclass prediction and other related learning tasks. We report empirical results that demonstrate the power of our approach in building accurate and structurally sparse models from high dimensional data.