The correlation structure of microarray data and its statistical implications
by Chen, Linlin, Ph.D., UNIVERSITY OF ROCHESTER SCHOOL OF NURSING, 2008, 130 pages; 3333497

Abstract:

Microarray technology has become an indispensable tool as it can measure expression levels of thousands of genes simultaneously. Gene expression microarrays are widely used to construct gene networks and to find genes that are differentially expressed between two (e.g., treatment versus control) or more phenotypes. The most basic issue to be addressed in this setting is that of multiple hypothesis testing. Numerous books and papers have discussed this issue since the advent of microarray technology more than ten years ago. However, most currently practiced methods of significance testing in microarray gene expression profiling remain unstable, and the variation of the number of false discoveries is high for typical situations with Affymetrix technology. These undesirable properties are due to the fact that the number of tests is typically orders of magnitude larger than the available sample size, as well as to the presence of strong and long-range (involving thousands of genes) correlations between gene expression levels, the latter having been well documented in the literature and in our own studies.

This dissertation is focused on possible causes of between-gene dependencies and their effects on the performance of gene selection procedures. It is commonly believed that the typically high correlations in gene pairs are attributable to technological flaws. This view, however, has been called into question by recent evidence that the random technical noise in microarray data is too low to exert a tangible effect on the sample correlation coefficients. We have constructed three new models to deepen the understanding of this question. The first extends prior work to the process of cross-hybridization, which is also believed to induce spurious correlations between gene expression signals. The stochastic model analyzes the competition of multiple probe-sets for a common transcript and finds no compelling evidence for the presence of large-scale effects of multiple targeting on the correlation structure. The goal of the second model is to explore multi-layer sources of correlation in microarray data in order to examine what kind of inference about gene-gene interactions is feasible. In particular, the model shows random effects caused by tissue heterogeneity. Although simplistic, it is consistent with some distinctive features of real data. The study is underpinned by the third, more comprehensive mechanistically-motivated model that describes the dynamics of cell populations and associated variations of gene expression in renewing tissues. Using these models, we also measure the mixture effect on the correlation structure based on real data. The new models reveal to what extent the observed high correlation can uncover the true nature of the association among genes. They show the limitation of network inference when based solely on observed sample correlation coefficients between gene expression signals.

Another basic issue considered in this dissertation has to do with improved methods of significance testing in conjunction with microarray data analysis. We present a new multiple testing procedure that balances type 1 and 2 errors in an optimal way, including the improvements to its statistical methodology using a recently proposed delta-sequence method. The new procedure better characterizes the correlation structure of microarray data, thus adding a new tool for removing the main obstacle standing in the way of gene selection procedures and other types of microarray data analysis.

 
AdvisersAndrei Yakovlev; Anthony Almudevar
SchoolUNIVERSITY OF ROCHESTER SCHOOL OF NURSING
SourceDAI/B 69-10, p. , Dec 2008
Source TypeDissertation
SubjectsBiostatistics; Bioinformatics
Publication Number3333497
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3333497
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.