Learning sparse classifiers with difference of convex functions algorithms
Sparsity of a classifier is a desirable condition for high dimensional data and large sample sizes. This paper investigates the two complementary notions of sparsity for binary classification: sparsity in the number of features and sparsity in the number of examples. Several different losses and regularizers are considered: the hinge loss and ramp loss, and l2, l1, approximate l0, and capped l1 regularization. We propose two new objective functions that further promote sparsity, corresponding to the ramp loss versions of approximate l0 and capped l1 regularization. We derive difference of convex functions algorithms (DCA) for solving these novel non-convex objective functions. We also propose an efficient DCA for optimizing the recently studied capped l1 regularizer under hinge loss. The proposed algorithms are shown to converge in a finite number of iterations to a local minimum. Using simulated data and several datasets from the UCI machine learning repository, we empirically investigate the fraction of features and examples required by the different classifiers.
Download the code and the data to reproduce experiments.
The software depends on OptWok version 0.1.