Parallel algorithms for large-scale computational metagenomics
by Wu, Changjun, Ph.D., WASHINGTON STATE UNIVERSITY, 2011, 118 pages; 3460453

Abstract:

Developing high performance computing solutions for modern day biological problems present a unique set of challenges. The field is experiencing a data revolution due to a rapid introduction of several disruptive experimental technologies. Consequently, computational methods that analyze biological data are currently being put to the test in their capability to scale to massive data sizes. Added to this data-intensiveness, is the brand of computation that is quite different in flavor to that in other, perhaps more traditional scientific computing fields. The problems are dominated by integer arithmetic, string matching, combinatorial space exploration, and graph-theoretic formulations that introduce irregularity in computation and communication patterns.

In this thesis, we report on our efforts to bridge the gap between biological data processing and high performance computing solutions. Specifically, we focus on the problem of clustering very large collections of protein sequences on distributed memory supercomputers. Given a set of amino acid sequences we reduce the problem to one of constructing sequence homology graph and subsequently detecting arbitrarily-sized dense subgraphs. Our approach efficiently parallelizes this task on a distributed memory machine through a combination of divide-and-conquer and combinatorial pattern matching heuristic techniques. Preliminary tests on an arbitrary collection of 2 million protein sequences from the Global Ocean Sampling project database reveal that our new approach is able to improve sensitivity, recruit more sequences, while considerably reducing the time to solution and memory requirement. The algorithmic techniques developed as part of this research have a wider applicability to other applications in computational biology wherever the need for conducting large-scale sequence analysis is the primary bottleneck.

 
AdviserAnanth Kalyanaraman
SchoolWASHINGTON STATE UNIVERSITY
SourceDAI/B 72-09, p. , Jul 2011
Source TypeDissertation
SubjectsBioinformatics; Computer science
Publication Number3460453
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3460453
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.