Sparse regression with exact clustering
by She, Yiyuan, Ph.D., STANFORD UNIVERSITY, 2008, 113 pages; 3332922

Abstract:

This dissertation deals with three closely related topics of the lasso in addition to supplying a comprehensive overview of the rapidly growing literature in this field.

The first part aims at improving the lasso to attain smaller predictor error while simultaneously keeping the model sparsity. We propose the data-augmented weighted lasso (DAWL) to make a natural combination of the lasso and other estimators like ridge regression. We investigate the data-augmentation starting from the ridge's nature in solving the singularity problem which successfully explains the reasonability of the elastic net, and from a non-asymptotic study of the lasso's variable selection which describes the roles of different parts of the Gram matrix played in lasso estimation and selection. A robust data-dependent scaling and a 'ranged lasso' are proposed to augment both the regression matrix (nondiagonally) and the response vector. In the discussions of weights, we prove a sharp oracle inequality for the weighted lasso in the orthogonal case, and propose z-value based weights with good asymptotics. Simulations show the advantages of DAWL in test error and sparsity.

The second topic is the study of a generic sparse regression problem with a customizable sparsity pattern matrix, motivated by, but not limited to, a supervised gene clustering problem in microarray data analysis. The 'clustered lasso' method is proposed with l1-type penalties on both the coefficients and their pairwise differences. Somewhat surprisingly, it shows a quite different behavior than the lasso or the fused lasso; the granted power of the l1-penalty to approximate the l0-penalty seems specious in this situation. This leads us to a theoretical study of the power and limitations, of the l1-penalty in the genera framework of sparse regression. We then discuss how to combine data-augmentation and weights to improve the naïve l1-penalty. To attack the challenging computation problem in high-dimensional space, we successfully generalize an iterative algorithm for solving the lasso and develop a novel accelerated 'annealing' algorithm with theoretical justifications. It applies to any sparse regression like the fused/clustered lasso, and can handle a large design matrix as well as a large sparsity pattern matrix with apparent ease.

In the third part, we discuss a class of thresholding-based iterative selection procedures (TISP) for model selection and shrinkage. People have long before noticed the weakness of the convex l1-constraint (or the soft-thresholding) in wavelets and have designed many different forms of nonconvex penalties to increase model sparsity and accuracy. But for a nonorthogonal regression matrix, there is great difficulty in both investigating the performance in theory and solving the problem in computation. TISP provides a simple and efficient way to tackle this so that we successfully borrow the rich results in the orthogonal design to solve the (nonconvexly) penalized regression for a general design matrix. Our starting point is, however, the thresholding rules rather than the penalty functions. Indeed, there is a universal connection between them. But a drawback of the latter is its non-unique form, and our way greatly facilitates the computation and the analysis. In fact, we are able to build the convergence theorem and explore the theoretical properties of the selection and the estimation via TISP nonasymptotically. More importantly, a novel Hybrid-TISP is proposed based on hard-thresholding and ridge-thresholding. It provides a fusion between the l0-penalty and the l2-penalty, and adaptively achieves the right balance between shrinkage and selection in statistical modeling. In practice, Hybrid-TISP shows superior performance in test-error and is parsimonious.

 
Advisor
SchoolSTANFORD UNIVERSITY
SourceDAI/B 69-10, p. , Dec 2008
Source TypeDissertation
SubjectsStatistics; Computer science
Publication Number3332922
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3332922
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.