Improving iterative similarity searches with better alignments and better statistics
by Gonzalez, Mileidy, Ph.D., UNIVERSITY OF MARYLAND, BALTIMORE COUNTY, 2010, 121 pages; 3408173

Abstract:

In this work, we evaluate the current limitations of iterative similarity searches and investigate several strategies to improve their performance. To evaluate the reliability of iterative approaches as homology inference tools we create RefProtDom—a benchmarking dataset of diverse query domains and full-length proteins containing their homologs in a variety of architectures to simulate searches of real proteins with complex homology relationships. RefProtDom’s homology annotations and boundaries were manually supplemented using local and semi-global searches, reciprocal searches, and structural classifications to ensure accurate alignment evaluation. Using RefProtDom, we identified a previously unrecognized source of error in PSI-BLAST (Position Specific Iterated BLAST) that is responsible for its profile corruption: Homologous Over-extension (HOE). HOE accounts for the largest fraction of the initial false positive errors (hard queries: 86%; sampled queries: 68%), and the largest fraction of false positives at iteration 5 (hard: 51-91%, sampled: 49-69%). We implement a (noExt) strategy to reduce the HOE error that increases PSI-BLAST’s specificity 4-8 fold. We also show that HOE is not a similaritymeasurement or statistical error, but rather an alignment strategy error to which all iterative similarity-searching methods are susceptible. The superior scores and gap penalties afforded by the rigorous strategies of PSI-SEARCH noExt and JACKHMMER, respectively, did not provide any improvement over PSI-BLAST noExt. Rigorous strategies indeed provide better sensitivity in pair-wise (e.g. SSEARCH: 6.2% hard family coverage; BLAST: 4.3%) and better specificity in profile (e.g. PSI-SEARCH: 0.05% hard non-homologous errors; PSI-BLAST: 2%), searches for the unmodified methods. In the absence of a noExt modification, JACKHMMER outperforms PSI-SEARCH, and both outperform PSI-BLAST. But, HOE is such a pervasive phenomenon for all iterative methods, that addressing it is enough to level the methods’ performance. Our noExt strategy effectively prevents the propagation of the HOE alignment pathology, but it does not prevent it from happening when a homology is first identified. We conclude by outlining future directions to improve alignment accuracy, which, if addressed, should allow the sensitivity of iterative searches to more closely approach that of structural comparisons.

 
AdviserMaricel G. Kann
SchoolUNIVERSITY OF MARYLAND, BALTIMORE COUNTY
SourceDAI/B 71-07, p. , Jul 2010
Source TypeDissertation
SubjectsBioinformatics
Publication Number3408173
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3408173
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.