The experiment predicts the ancestry for the subject with ID 163 from the publicly available Human Genome Diversity Project (HGDP) using genetic data. The subject is represented with a collection of Single Nucleotide Polymorphism (SNP) genotypes, which are variations in single nucleotides at specific loci in the human genome. SNPs represent the small collection of genotypes that contain variation across the human population and can contain signatures for variation across different geographic sub-populations. The experiment only considers seven broad continent-level populations and the ancestry prediction produces estimated percentage proportions for these populations for the subject. The verified ancestry for subject HGDP 163 is the Sindhi sub-population of the Central and South Asian population. The algorithm deployed is a simplified version of the well-known ADMIXTURE/STRUCTURE model in population genetics. From a pre-selected subset of ancestry-informative SNPs, the algorithm aggregates the likelihood score for each major population across all SNPs to estimate the different population proportions, assuming SNPs are independent.
The overview of the experiment is as below: ![Overview of the experiment] It requires **two input files**: **File 1:** The SNP datafile for the subject containing the following columns: 1. **rsid:** The ID of each particular SNP 2. **chr:** The chromosome ID of the SNP 3. **pos:** The position in the chromosome of the SNP 4. **genotype:** The subject's allele pair for the SNP ![sample of subject's snp data input format] **File 2:** The population allele frequency file containing the allele frequencies for all seven populations for the 500 ancestry informative markers. The **snpNM** column is the SNP ID which aligns with the **rsid** column in the previous file. The **Allele** column contains a specific allele label and the seven population columns contain the estimated frequency of that allele label within each population. ![Population allele frequencies for seven major populations] The steps to compute the list frequencies for SNP is explained as follows: First, a small set of ancestry-informative SNPs are selected from analysis of SNP data collected from the roughly one thousand subjects contained in the Human Genome Diversity Project study. These samples come from 52 sub-populations worldwide. The top Laplacian eigen-functions of the subjects' adjacency matrix were used to summarize the population structure in the samples. A regularized sparse regression of the eigenfunctions was performed to score individual SNPs for the purpose of identifying the 500 most ancestry-informative genetic markers. Interested readers can find details about this procedure in the published article [Ancestral Informative Marker Selection and Population Structure Visualization Using Sparse Laplacian Eigenfunctions]. Next, for the list of informative SNPs, the frequencies for each allele (i.e., "A", "G", "C" or "T") for each SNP are estimated for each major population. **Code**: Given the two input data files, a Python implementation of the ADMIXTURE algorithm is used to estimate the ancestry proportions. - Step1: Loading the input genotype and allele frequency files. ![Python code loading the input files] - Step2: Assuming each individual has various proportions from the seven populations, the ADMIXTURE algorithm is used to estimate these proportions. A standard multinomial probability model is constructed for each SNP locus. The overall log-likelihood function is a simple summation over all SNPs, assuming the SNPs are roughly independent. Our experiment implements a simpler variant of the standard ADMIXTURE model where the sub-population membership of the training data is given and fixed such that population allele frequencies don't need to be iteratively estimated. For a complete reference, please read the article by Alexander, Novembre and Lang. [Fast model-based estimation of ancestry in unrelated individuals] ![Python code for computing the likelihood at each SNP] - Step3: Finally an E-M algorithm to estimate the ancestry proportions for a test subject from the allele frequencies learned from the training data.. ![Python code for the EM algorithm] **Result Output:** ![Result page giving the estimated genetic proportions from different populations ] From the result output seen below, one can see that the Central and South Asia ancestry dominates the genome, and small percentage from Africa, Europe and Middle East. The result roughly matches the self-claimed ancestry. *This experiment is created by a Microsoft employee.* : https://raw.githubusercontent.com/aniu/aml/master/expView_HGDP163.PNG : https://raw.githubusercontent.com/aniu/aml/master/snpData.PNG : https://raw.githubusercontent.com/aniu/aml/master/top500_allele.PNG : http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0013734 : https://raw.githubusercontent.com/aniu/aml/master/pythoncode1.PNG : http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2752134/ : https://raw.githubusercontent.com/aniu/aml/master/Pythoncode3.PNG : https://raw.githubusercontent.com/aniu/aml/master/Pythoncode4.PNG : https://raw.githubusercontent.com/aniu/aml/master/hgdp163.PNG