Clustering by genetic ancestry using genome-wide single nucleotide polymorphisms and incorporating genetic ancestry into genetic risk prediction models
by Solovieff, Nadia, Ph.D., BOSTON UNIVERSITY, 2011, 126 pages; 3463284

Abstract:

Genome-wide association studies (GWAS) have detected disease associated variants and increased the feasibility of building genetic risk prediction models. Population stratification (PS) causes spurious associations in GWAS and occurs when differences in allele frequencies of genetic markers are due to ancestral differences between cases and controls rather than the disease. Principal components analysis (PCA) is the established approach to detect PS and to adjust the genetic association for stratification by including the top principal components (PCs) in the analysis. An alternative solution to PS is genetic matching of cases and controls that requires, however, well defined population strata for appropriate selection of cases and controls. The strata defined for matching allow the investigator to examine cluster specific effects which can enhance our understanding of disease associated variants and improve the accuracy of risk prediction models.

In this thesis, we propose a new approach to test genetic associations and build genetic risk models in the presence of PS from GWAS. We first design a novel algorithm that uses the top PCs from a PCA to cluster individuals with similar ancestry into groups to match cases and controls. We demonstrate the effectiveness of our algorithm in real and simulated data, and show that matching cases and controls substantially reduces PS bias and can be more powerful than adjustment for PCs.

Next, we use the algorithm to examine the population substructure of African Americans with sickle cell disease and show that they are less genetically admixed than African Americans without the disease and have ancestry similar to populations from western Africa.

Finally, we propose an approach to build a genetic risk prediction model that incorporates ethnic specific effects. We extend the framework of a Bayesian naïve classifier to include ancestry and show how a prediction can be made even when the ancestry for an individual is unknown. We compare the Bayesian classifiers to logistic regression models that include a genetic risk score. We show that incorporating ancestry improves the accuracy of prediction in both the Bayesian and logistic regression framework but that the accuracy is higher for the Bayesian classifier.

 
AdviserPaola Sebastiani
SchoolBOSTON UNIVERSITY
SourceDAI/B 72-09, p. , Sep 2011
Source TypeDissertation
SubjectsGenetics; Statistics; Epidemiology
Publication Number3463284
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3463284
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.