Statistical analysis of biological interactions from homologous proteins
by Xu, Qifang, Ph.D., TEMPLE UNIVERSITY, 2008, 151 pages; 3344407

Abstract:

Information fusion aims to develop intelligent approaches of integrating information from complementary sources, such that a more comprehensive basis is obtained for data analysis and knowledge discovery. Our Protein Biological Unit (ProtBuD) database is the first database that integrated the biological unit information from the Protein Data Bank (PDB), Protein Quaternary Server (PQS) and Protein Interfaces, Surfaces and Assemblies (PISA) server, and compared the three biological units side-by-side. The statistical analyses show that the inconsistency within these databases and between them is significant. In order to improve the inconsistency, we studied interfaces across different PDB entries in a protein family using an assumption that interfaces shared by different crystal forms are likely to be biologically relevant. A novel computational method is proposed to achieve this goal. First, redundant data were removed by clustering similar crystal structures, and a representative entry was used for each cluster. Then a modified k-d tree algorithm was applied to facilitate the computation of identifying interfaces from crystals. The interface similarity functions were derived from Gaussian distributions fit to the data. Hierarchical clustering was used to cluster interfaces to define the likely biological interfaces by the number of crystal forms in a cluster. Benchmark data sets were used to determine whether the existence or lack of existence of interfaces across multiple crystal forms can be used to predict whether a protein is an oligomer or not. The probability that a common interface is biological is given. An interface shared in two different crystal forms by divergent proteins is very likely to be biologically important. The interface data not only provide new interaction templates for computational modeling, but also provide more accurate data for training sets and testing sets in data-mining research to predict protein-protein interactions. In summary, we developed a framework which is based on databases where different biological unit information is integrated and new interface data are stored. In order for users from the biology community to use the data, a stand-alone software program, a web site with a user-friendly graphical interface, and a web service are provided.

 
AdvisersZoran Obradovic; Roland Dunbrack
SchoolTEMPLE UNIVERSITY
SourceDAI/A 70-01, p. , Apr 2009
Source TypeDissertation
SubjectsBioinformatics; Information science; Computer science
Publication Number3344407
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3344407
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.