Functional characterization of large scale biological data
by Xie, Hongbo, Ph.D., TEMPLE UNIVERSITY, 2008, 261 pages; 3300385

Abstract:

Rapid growth in collection of new gene and protein related information requires development of appropriate data analysis techniques. Towards such a goal in this study we propose incorporating prior knowledge of biological functions of certain proteins into early stages of data analysis process. As the first aim of our study, a novel method aimed at functionally characterizing gene patterns over time is proposed. Our approach is based on analysis of functional expression profiles (FEP), each defined as the average of expression patterns for genes annotated with a given function. A FEP is computed only over significantly correlated patterns that also vary significantly in time. In our study an effective clustering method is proposed to automatically discover the most informative groups of FEPs. The new method is evaluated on several important time-course gene expression datasets (including development cycle of malaria-related Plasmodium Falciparum) where it successfully identified correlated functional expression profiles. Furthermore, clustering these functional expression profiles provided groups of functions with similar expression profile pattern and close biological functional relevance. These results indicate that the analysis method could lead to novel biological conclusions and benefit research on various types of Microarray data. As the second aim, we proposed a novel technique for exploiting knowledge of a biological function into biomarker candidates' identification. In our two-step approach for selection of genetic biomarkers from microarray data the underlying assumption is that disease is characterized by deviations in expression of genes from a limited set of functions. We start by selection of significantly differentially expressed genes by using a standard statistical testing procedure. Using functional domain knowledge, we analyze biological functions of the selected genes to discover the ones that are highly overrepresented by the selection. Only the selected genes annotated with the most significant function are selected as biomarker candidates. The new method is applied for identification of biomarkers for Chronic Fatigue Syndrome (CFS). The approach resulted in a small set of biomarkers whose functions are the most relevant to CFS that was superior to a much larger set determined the traditional one-step analysis. We also explored benefits of combining Microarray and proteomics data for CFS identification. Using the standard procedure for preprocessing of ProteinChip data, we developed a proteomics-based predictor of CFS. The results on the samples with both Microarray and ProteinChip data indicates that predictor combination can provide improved CFS identification. Our analysis of the clinical CFS data identified factors that explain sources of CFS identification mistakes suggesting that CSF identification could be further improved by revising definitions of certain clinical conditions. In the last aim we developed a bioinformatics method to identify structurally related functions in disordered proteins. In our method a statistical evaluation is employed to rank the significance of identified correlations where protein sequence data redundancy and the relationship between protein length and protein structure were taken into consideration to ensure the quality of the statistical inference. We applied the new method on Swiss-Prot database to identify intrinsic disorder correlated functional keywords. This work enriches the current view of protein structure-function relationships, especially with regards to functionalities of intrinsically disordered proteins and provides researchers with a novel tool that could be used to improve the understanding of the relationship between protein structure and function. (Abstract shortened by UMI.)

 
AdviserZoran Obradovic
SchoolTEMPLE UNIVERSITY
SourceDAI/B 69-01, p. , Apr 2008
Source TypeDissertation
SubjectsBioinformatics; Computer science
Publication Number3300385
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3300385
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.