Prediction of transcription factor binding sites using information from multiple species
by Siewert, Elizabeth Allan, Ph.D., UNIVERSITY OF COLORADO HEALTH SCIENCES CENTER, 2010, 214 pages; 3416555

Abstract:

De novo identification of transcription factor binding sites (TFBS) is a challenging computational problem because TFBS are relatively short sequences buried in long genomic regions. Earlier methods incorporated genome-wide expression data and promoter sequences into a linear-model framework, regressing expression on counts of putative TFBSs in promoters for a single species. More recently, it has been shown that including sequence data from multiple species improves the predictive ability of this regression model.

In this thesis, we describe two extensions of this single-species, linear-model framework. These algorithms extend the search space to both sequence and expression information from all available genes across multiple species. Our first model uses a repeated-measures approach where we treat the gene-expression measurements across species as repeated measurements across evolutionary time. This model imposes the phylogenetic relationships among species on the error covariance structure. Our second model uses a Bayesian hierarchical approach, where we impose the phylogenetic relationships among the species on the prior distributions of the regression coefficients. For each model, we also consider (1) retaining all covariates in the model in a forward selection manner or (2) calculating and using the residual expression measures for each subsequent regression.

These multiple-species algorithms were developed using a data set of four yeast species grown under heat-shock conditions and comparisons are made first to the single-species algorithm, and secondly to each other. Using evaluations based on the information content of the predicted motifs, and comparisons to two independent data sets, we find that all multiple-species results show an improvement in the prediction of TFBS over the single species algorithm.

 
AdviserKaterina J. Kechris
SchoolUNIVERSITY OF COLORADO HEALTH SCIENCES CENTER
SourceDAI/B 71-08, p. , Aug 2010
Source TypeDissertation
SubjectsBiostatistics; Bioinformatics
Publication Number3416555
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3416555
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.