Application of statistical properties of short sequences in the analysis of 16S ribosomal RNA and the identification of bacteria
by Zhu, Dianhui, Ph.D., UNIVERSITY OF HOUSTON, 2007, 172 pages; 3279600

Abstract:

Comparisons of 16S ribosomal RNA, (16S rRNA), are widely used to characterize relationships between bacteria and to identify unknown bacteria. The dramatically increasing number of 16S rRNA sequences and the large number of species present numerous computational challenges, many of which are addressed here. First, in order to facilitate efforts to characterize the sequence of hundreds of 16S rRNAs at a time, a software tool, STITCH, was developed. STITCH automates the process of splicing sequences obtained from reverse and forward primer reads and automatically searches the resulting sequence against the NCBI online database or a local database of type strains. STITCH has been used to process over 4,000 sequences. Second, an efficient software tool known as ProkProbePicker, (PPP), was developed to rapidly design probe-target n-mers for all major groupings in a known phylogenetic tree using a fast string search based on the Karp-Rabin algorithm. When parallelized, the run time for this algorithm was reduced to 67 minutes from 87 hours. Third, in order to rapidly characterize the similarity of large numbers of 16S rRNA sequences, alignment independent comparisons using n-mers were examined. Three measures of distance were considered: the linear correlation coefficient, the Angle distance, and the Manhattan distance. The Angle distance measure using 6-mers gave the best correlation with standard alignment based methods and was therefore used to identify clusters of similar 16S rRNA sequences among over 300,000 database entries. Finally, an evolutionary computing approach was used to design universal arrays of 16S rRNA target subsequences that can be used to place any unidentified bacterium in the known phylogenetic context of a group of 16S rRNA sequences that represent the major clusters found with the Angle distance measure. A target set consisting of 703 20-mers was identified that was able to place an unknown organism within five tree nodes of the correct location over half the time. A larger array of 6011 20-mers achieved accuracy that was very close to the maximum obtainable accuracy.

 
Advisor
SchoolUNIVERSITY OF HOUSTON
SourceDAI/B 68-09, p. , Dec 2007
Source TypeDissertation
SubjectsBioinformatics; Computer science
Publication Number3279600
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3279600
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.