Extracting signal from noise in biological data: Evaluations and applications of text mining and sequence coevolution
by Caporaso, J. Gregory, Ph.D., UNIVERSITY OF COLORADO HEALTH SCIENCES CENTER, 2009, 177 pages; 3361660

Abstract:

As the quantity of biological data continues to expand, it is the role of the computational biologist to develop new methods and tools to efficiently and accurately translate biological data into biological knowledge. Focusing on biomedical literature and biological sequences, this dissertation is about techniques for learning more from biological data.

The early chapters address biomedical text mining, and specifically the problem of automatically compiling data on protein point mutations from biomedical literature. Protein point mutations and substitutions are central in many areas of biomedical research, including human disease, biodiversity, and protein structure/function relationships. Mutation databases exist to centralize known information, but are frequently expensive to compile. An automated approach is presented for developing high-performance text mining systems and is applied to develop MutationFinder, a tool that scans text and extracts descriptions of point mutations into structured formats. Manual and automated approaches for annotating mutations are then compared, resulting in the conclusion that combining automatic and manual annotation tools may be the best approach to develop comprehensive and accurate biomedical databases.

The later chapters focus on identifying pairs of coevolving positions in proteins. Just as macroscopic structures like the bills of hummingbirds and the corolla tubes of flowering plants coevolve, it is expected that interacting positions within and between proteins also coevolve to maintain highly specific interactions. If true, coevolutionary signals should be detectable in multiple sequence alignments and may contain information on intramolecular or intermolecular interactions between amino acid residues. An analysis of coevolution algorithms leads to the surprising conclusion that algorithms that do not incorporate phylogeny can match the performance of those that do incorporate phylogeny. A coevolution algorithm is then applied to predict interactions between component proteins of the Type VI Secretion System (T655), leading to a new model of the T6SS.

Additional contributions of this work include two open-source software projects, MutationFinder and the PyCogent coevolution module. These high-quality software tools allow for the reproduction, application, and expansion of the work presented in this dissertation: text mining for point mutations, and detecting coevolution of biological sequences.

 
AdviserLawrence Hunter
SchoolUNIVERSITY OF COLORADO HEALTH SCIENCES CENTER
SourceDAI/B 70-06, p. , Oct 2009
Source TypeDissertation
SubjectsBioinformatics
Publication Number3361660
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3361660
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.