Detecting recombination and its mechanistic association with genomic features via statistical models
by Rajaram, Misha, Ph.D., IOWA STATE UNIVERSITY, 2010, 157 pages; 3438725

Abstract:

Recombination is a powerful weapon in the evolutionary arsenal of retroviruses such as HIV. It enables the production of chimeric variants or recombinants that may confer a selective advantage to the pathogen over the host immune response. Recombinants further accentuate differences in virulence, disease progression and drug resistance mutation patterns already observed in non-recombinant variants of HIV. This thesis describes the development of a rapid genotyper for HIV sequences employing supervised learning algorithms and its application to complex HIV recombinant data, the application of a hierarchical model for detection of recombination hotspots in the HIV-1 genome and the extension of this model enabling estimation of the association between recombination probabilities and covariates of interest.

The rapid genotyper for HIV-1 explores a solution to the genotyping problem in the machine learning paradigm. Of the algorithms tested, the genotyper built using Bayesian additive regression trees (BART) was most successful in efficiently classifying complex recombinants that pose a challenge to other currently available genotyping methods. We also developed a novel method, bootSMOTE, for generating synthetic data in order to supplement insufficient training data. We found that supplementation with synthetic recombinants especially boosts identification of complex recombinants. We describe the genotyper software available for download as well as a web interface enabling rapid classification of HIV-1 sequences.

Hotspots for recombination in the HIV-1 genome are modeled using spatially smoothed changepoint processes. This hierarchical model uses a phylogenetic recombination detection model of dual changepoint processes at the lower level. The upper level applies a Gaussian Markov random field (GMRF) hyperprior to population-level recombination probabilities in order to efficiently combine the information from many individual recombination events as inferred at the lower level. Focusing on 544 unique recombinant sequences, we found a novel hotspot in the pol gene of HIV-1 while confirming the presence of high recombination activity in the env gene.

Valuable insights into the molecular mechanism of recombination may be gained by extending the GMRF model to include covariates of interest. We add a level to the hierarchical model and allow for the simultaneous inference of recombination probabilities as well their association with genomic covariates of interest. Using a set of 527 unique recombinants, we confirmed the presence of the pol hotspot. Interestingly, we found significant positive associations of spatial fluctuations in recombination probabilities with genomic regions prone to forming secondary structure as well as significant negative associations with regions that support tight RNA-DNA hybrid formation. Overall, our results support the theory that pause sites along the genome promote recombination.

 
AdvisersKarin Dorman; Dennis Lavrov
SchoolIOWA STATE UNIVERSITY
SourceDAI/B 72-03, p. , Feb 2011
Source TypeDissertation
SubjectsBioinformatics
Publication Number3438725
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3438725
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.