Predictive models of gene regulation
by Kundaje, Anshul Bharat, Ph.D., COLUMBIA UNIVERSITY, 2008, 163 pages; 3333379

Abstract:

The regulation of gene expression plays a central role in the development and function of a living cell. A complex network of interacting regulatory proteins bind specific sequence elements in the genome to control the amount and timing of gene expression. The abundance of genome-scale datasets from different organisms provides an opportunity to accelerate our understanding of the mechanisms of gene regulation. Developing computational tools to infer gene regulation programs from high-throughput genomic data is one of the central problems in computational biology.

In this thesis, we present a new predictive modeling framework for studying gene regulation. We formulate the problem of learning regulatory programs as a binary classification task: to accurately predict the condition-specific activation (up-regulation) and repression (down-regulation) of gene expression. The gene expression response is measured by microarray expression data. Genes are represented by various genomic regulatory sequence features. Experimental conditions are represented by the gene expression levels of various regulatory proteins. We use this combination of features to learn a prediction function for the regulatory response of genes under different experimental conditions. The core computational approach is based on boosting. Boosting algorithms allow us to learn high-accuracy, large-margin classifiers and avoid overfitting. We describe three applications of our framework to study gene regulation: (1) In the GeneClass algorithm, we use a compendium of known transcription factor binding sites and gene expression data to learn a global context-specific regulation program that accurately predicts differential expression. GeneClass learns a prediction function in the form of an alternating decision tree, a margin-based generalization of a decision tree. We introduce a novel robust variant of boosting that improves stability and biological interpretability in the presence of correlated features. We also show how to incorporate genome-wide protein-DNA binding data from ChIP-chip experiments into the framework. (2) In several organisms, the DNA binding sites of many transcription factors are unknown. Hence, automatic discovery of regulatory sequence motifs is required. In the MEDUSA algorithm, we integrate raw promoter sequence data and gene expression data to simultaneously discover cis regulatory motifs ab initio and learn predictive regulatory programs. MEDUSA automatically learns probabilistic representations of motifs and their corresponding target genes. We show that we are able to accurately learn the binding sites of most known transcription factors in yeast. (3) We also design new techniques for extracting biologically and statistically significant information from the learned regulatory models. We use a margin-based score to extract global condition-specific regulomes as well as cluster-specific and gene-specific regulation programs. We develop a post-processing framework for interpreting and visualizing biological information encapsulated in our models.

We show the utility of our framework in analyzing several interesting biological contexts (environmental stress responses, DNA-damage response and hypoxia-response) in the budding yeast Saccharomyces cerevisiae. We also show that our methods can learn regulatory programs and cis regulatory motifs in higher eukaryotes such as worms and humans. Several hypotheses generated by our methods are validated by our collaborators using biochemical experiments. Experimental results demonstrate that our framework is quantitatively and qualitatively predictive. We are able to achieve high prediction accuracy on test data and also generate specific, testable hypotheses.

 
AdviserChristina Leslie
SchoolCOLUMBIA UNIVERSITY
SourceDAI/B 69-10, p. , Dec 2008
Source TypeDissertation
SubjectsBioinformatics; Artificial intelligence; Computer science
Publication Number3333379
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3333379
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.