Integrating clinical and genetic information to improve clinicians' ability to estimate an individual's disease risk is an important biomedical research challenge. This dissertation develops a "risk index" procedure that combines clinical data and genome-wide genotypes to make predictions about individuals' risk of disease.
For a set of 100 simulated datasets containing 1,000 individuals, 8 clinical covariates, 500 Single nucleotide polymorphisms, and an outcome prevalence of 30%, the average area under the receiver operating characteristics (ROC) curve (AUC) for a risk index model built with clinical covariates and SNPs was significantly higher than a model built with clinical covariates alone (0.846 vs. 0.832, p=0.0002). A risk index model that includes the principal components that account for 90% of the variability in the SNPs also significantly increased the average AUC compared to a clinical covariates only model (0.839 vs. 0.826, p=0.008). For a set of 25 simulated datasets containing 10,000 individuals, 29 clinical covariates, and 38,835 SNPs, a significant difference in average AUC was observed between clinical and clinical+genotype models (0.939 vs. 0.926, p=0.001), using the 500 SNPs most highly associated with the outcome. A risk index model including the 500 largest principal components of the 38,835 SNPs did not significantly increase the mean AUC beyond the clinical model (0.931 vs. 0.931, p=0.98).
The risk index methodology was then applied to individuals from the Framingham Heart Study using 27 clinical covariates and 48,071 SNPs. Clinical+genotype risk index models built to predict ten-year incident hypertension, ten-year incident diabetes, or prevalent hypertension had AUCs of 0.475, 0.682, and 0.692, respectively, using the 500 SNPs most highly associated with the outcome, and AUCs of 0.563, 0.782, and 0.712, respectively, using the 500 largest principal components of the SNPs.
The results from these analyses suggest that the risk index methodology has utility for predicting an individual's risk of developing a chronic disease, and that the use of principal components of a large set of SNPs in place of a smaller selected set of associated SNPs provides the best predictive performance.