A quasi-alignment based framework for discovery of conserved regions and classification of DNA fragments

by Nagar, Anurag, Ph.D., SOUTHERN METHODIST UNIVERSITY, 2013, 173 pages; 3606939


The last decade has seen an exponential growth in the amount of biological and genomic data produced by the various sequencing projects and laboratories using Next Generation Sequencing. There is a critical need for algorithms and tools that can efficiently analyze this data and generate useful summary information. Traditional methods such as Sequence Alignment and its derivatives have quadratic time complexity and thus are not suited for such large scale analysis. In this work, we use high efficiency data stream methods to rapidly analyze and cluster sequences based on their frequency distributions. Existing methods that convert sequences to frequency vectors do so at the cost of losing all associated meta data such as their position within the sequence. This research proposes a position-sensitive clustering algorithm that is able to retain some of the meta data and use it to uncover interesting and novel details. We also extend and further develop the theory of Quasi-Alignment by separating it into two phases---position sensitive clustering and association discovery. Both these phases are analyzed in detail and various applications are presented. Using position sensitive clustering, it is possible to identify conserved regions across multiple sequences in linear time by completely avoiding the costly Multiple Sequence Alignment procedures. Similarly, our methods allow analysis of sequences using much larger segment sizes that traditional heuristics such as Clustal. Our methods are able to store sequence details and associated meta information in the form of compact models, referred to as GenModels. The clusters and their associated transitions can be used for scoring sequences. This idea is used in this work to classify sequences against known GenModels and thus predict the taxonomic hierarchy of the sequence. Our experiments are conducted on shorter fragments, which are typical in Next Generation Sequencing, and the results show that in case of 16S rRNA sequences our methods are able to outperform the leading classifier.

AdviserMichael Hahsler
Source TypeDissertation
SubjectsBioinformatics; Computer science
Publication Number3606939

About ProQuest Dissertations & Theses
With nearly 4 million records, the ProQuest Dissertations & Theses (PQDT) Global database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

PQDT Global combines content from a range of the world's premier universities - from the Ivy League to the Russell Group. Of the nearly 4 million graduate works included in the database, ProQuest offers more than 2.5 million in full text formats. Of those, over 1.7 million are available in PDF format. More than 90,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or contact ProQuest Support.