The dysregulated activity of oncogenic transcription factors contributes to neoplastic transformation by promoting aberrant expression of target genes involved in regulating cell homeostasis. Therefore, characterization of the regulatory networks controlled by these transcription factors is a critical objective in understanding the molecular mechanisms of cell transformation. Modern high throughput technologies are providing the first window into regulatory processes on the genome-scale, foretelling the ability of computational inference algorithms to produce models of regulatory networks that will revolutionize our understanding and treatment of cancer biology by (1) describing how genomic alterations cause functional disruptions in the network regulating cell homeostasis, leading to aberrant cell growth and cancer, and (2) predicting therapeutic interventions, in which critical components of the network can be targeted to revert the cancer phenotype.
This thesis will develop methods that advance the current state of the art in inferring transcriptional regulatory networks from high throughput data, with specific application to both gene expression and ChIP-on-chip data. Prior to this thesis, several methods had been proposed to infer regulatory networks from microarray data; however, these methods were applicable only to model organisms, such as yeast, due to high computational complexity. Moreover, all methods relied to some extent on various assumptions that are not biologically realistic. Here, I will develop a novel method, based on information theory, that overcomes these limitations in that it has low computational complexity, allowing application to mammalian systems, and makes minimal assumptions about the structure of the network or about the type of statistical interaction between genes (e.g. linear models). I will apply this method to reconstruct the first genome-wide regulatory network inferred from microarray data for mammalian cells, and further demonstrate how this method can be used to deduce regulatory interactions between subnetworks controlled by different oncogenes, using only microarray data. I will extend this analysis, again using the tools of information theory, to consider inference of interactions involving more than two variables. To do so, I provide a rigorous definition of statistical dependency in the multivariate setting, which previously had not been done. I demonstrate that this framework effectively identifies groups of genes that interact in a pathway to jointly regulate a common set of targets. While the microarray analysis methods are motivated by issues specific to inferring gene regulatory networks, the resulting algorithmic advances are novel from a purely mathematical/computational perspective, and should be generally applicable to reverse engineering networks from measurements of the interacting variables, which is a general problem both in other branches of systems biology (e.g. metabolic networks, neural networks), as well as scientific applications outside of systems biology (e.g. social networks, electrical networks).
In the second part of the thesis I consider analysis of ChIP-on-chip experiments, which is a new technology that more directly measures transcription factor-chromatin interactions. I show that existing methods to analyze these data are not able to assign meaningful statistical significance scores (p-values) to bound promoters, due to a number of flawed assumptions. I then develop a data driven method that accurately predicts the extent of TF/DNA binding, and reveals an order of magnitude more interactions than previous methods. When combined with DNA sequence and gene expression data, I will demonstrate how application of this method can deduce regulatory networks of substantially greater complexity than previously appreciated. Moreover, I use this method to analyze the interaction between regulatory networks controlled by two important proto-oncogenes (MYC and NOTCH1), which were predicted to be statistically significantly overlapping by the gene expression-based analysis of the first section. This analysis reveals that these networks are in fact virtually completely overlapping, with MYC and NOTCH1 jointly regulating several thousand targets.
Much additional work must be done in this new field, both computationally and technologically, to reach the goal of building predictive models able to describe the connection between genomic alterations and malignancies such as cancer. However, this thesis takes steps in this direction by developing computational methods to leverage cutting-edge genome-wide measurement technologies to understand the regulatory networks controlling cellular function and homeostasis. The resulting systems-level view of transcriptional regulation already reveals fundamentally more complexity than previously anticipated, altering the traditional view of genetic regulatory networks.