Sparse hierarchical regression with polynomials

摘要

We present a novel method for sparse polynomial regression. We are interested in that degree r polynomial which depends on at most k inputs, counting at most \(\ell\) monomial terms, and minimizes the sum of the squares of its prediction errors. Such highly structured sparse regression was denoted by Bach (Advances in neural information processing systems, pp 105–112, 2009) as sparse hierarchical regression in the context of kernel learning. Hierarchical sparse specification aligns well with modern big data settings where many inputs are not relevant for prediction purposes and the functional complexity of the regressor needs to be controlled as to avoid overfitting. We propose an efficient two-step approach to this hierarchical sparse regression problem. First, we discard irrelevant inputs using an extremely fast input ranking heuristic. Secondly, we take advantage of modern cutting plane methods for integer optimization to solve the remaining reduced hierarchical \((k, \ell )\)-sparse problem exactly. The ability of our method to identify all k relevant inputs and all \(\ell\) monomial terms is shown empirically to experience a phase transition. Crucially, the same transition also presents itself in our ability to reject all irrelevant features and monomials as well. In the regime where our method is statistically powerful, its computational complexity is interestingly on par with Lasso based heuristics. Hierarchical sparsity can retain the flexibility of general nonparametric methods such as nearest neighbors or regression trees (CART), without sacrificing much statistical power. The presented work hence fills a void in terms of a lack of powerful disciplined nonlinear sparse regression methods in high-dimensional settings. Our method is shown empirically to scale to regression problems with \(n\approx 10{,}000\) observations for input dimension \(p\approx 1000\).