Learning graphical models with limited observations of high-dimensional data
by Yoruk, Erdem, Ph.D., THE JOHNS HOPKINS UNIVERSITY, 2011, 217 pages; 3483394

Abstract:

In a variety of computational domains, the number of samples available for learning remains relatively small as compared to increasing data dimensions. This is common in computational biology and vision, posing even greater challenges when combined with complex interactions among variables. Consequently, the bias-variance trade-off requires one to invest model parameters with the utmost care for robust learning. In the small sample context, we argue for incorporating the available prior knowledge, and introducing carefully chosen biases to reduce variance. Motivated by this philosophy and particular problems from biology and vision, we propose two new generative approaches within a graphical model formalism: (a) A comprehensive statistical model for analyzing cell signaling networks, and (b) A restricted family of latent variable forest models for discovery of complex dependencies.

The first method is particular to protein signaling networks, which play a central role in transcriptional regulation and the etiology of many diseases. With known molecular connections, our model is anchored to a pre-defined core signaling topology. It has a limited complexity due to parameter sharing and uses expression data of target genes as the only observable components. Specifically, we account for cell heterogeneity and a multi-level process, representing signaling as a Bayesian network at the cell level, modeling measurements as ensemble averages at the tissue level and incorporating patient-to-patient differences at the population level. We applied our method to the RAS-RAF network using a breast cancer study. We demonstrated robust statistical inference, established reproducibility through simulations and the ability to recover receptor status from available microarray data.

Our second method addresses the deeper endeavor of model selection. We propose a restricted family of forest structured distributions which are Markov with observed leaf variables regulated hierarchically by non-terminal latent variables. With a nested design, our model family allows a well-principled stepwise discovery of dependencies via sequential aggregations of pending substructures. Using particular parametric choices, we prove identifiability of our models and exact inference via dynamic programming. We apply our generative approach to synthesis and classification of handwritten digits, and to phenotype prediction from microarray data, with performances comparable to the state-of-the-art discriminative methods.

 
AdviserDonald Geman
SchoolTHE JOHNS HOPKINS UNIVERSITY
SourceDAI/B 73-01, p. , Nov 2011
Source TypeDissertation
SubjectsApplied mathematics; Biomedical engineering; Electrical engineering; Oncology
Publication Number3483394
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3483394
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.