Network based prediction of protein localization using diffusion kernel
by Mondal, Ananda Mohan, Ph.D., UNIVERSITY OF SOUTH CAROLINA, 2011, 101 pages; 3489340

Abstract:

With the availability of an overwhelming amount of high-throughput biological data, biologists and medical researchers increasingly depend on computational algorithms for hypothesis generation and prediction. One area of bioinformatics research is the development of algorithms for predicting subcellular localization of both monoplex and multiplex proteins. Most of current localization prediction algorithms employ features derived from protein sequence data and external functional annotations such as gene ontology or physicochemical properties. However, there is no method that can exploit rich localization information in a protein-protein correlation network since correlated proteins tend to be co-localized within the cell. Here we propose a novel diffusion kernel and logistic regression based algorithm, NetLoc, for protein localization prediction by exploiting protein correlation networks. NetLoc is applied to yeast protein localization prediction using four types of protein networks including physical protein-protein interaction (PPI) networks, genetic PPI networks, mixed PPI networks, and co-expressed PPI networks. Experiments showed that protein networks can provide rich information for localization prediction, achieving an AUC score up to 0.93. We also showed that networks with high connectivity and high percentage of co-localized PPI lead to better prediction performance. Compared to a previous network feature based prediction algorithm with an AUC score of 0.52 on the yeast PPI network, NetLoc achieved significantly better overall performance with an AUC of 0.74 on the same dataset. We also investigated how the prediction performance of NetLoc was affected by the network characteristics such as ratio of the number of co-localized PPI (coPPI) to the number of non-co-localized PPI (ncPPI) and the density of annotated coPPI in the network. For a given network with a specific number of proteins, NetLoc performance increases with increasing coPPI/ncPPI ratio and increasing density of annotated coPPI.

Another limitation of current protein localization algorithms is that they are not capable of predicting multi-location proteins. NetLoc algorithm addressed this limitation by calculating probabilistic scores for all locations for each query protein. Evaluation on the Yeast multi-localization protein dataset showed that the overall success rate of NetLoc is 88%, which is much higher than the existing method (73%) tested on the same dataset. Finally, we proposed and evaluated two methods for network based localization prediction based on multiple protein correlation networks. One is by constructing a unified protein correlation network. The other is to use multiple network kernels. Experiment showed that both methods can improve the NetLoc performance compared to original individual network.

 
AdviserJianjun Hu
SchoolUNIVERSITY OF SOUTH CAROLINA
SourceDAI/B 73-04, p. , Jan 2012
Source TypeDissertation
SubjectsBioinformatics; Computer science
Publication Number3489340
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3489340
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.