Showing posts with label HapMap. Show all posts
Showing posts with label HapMap. Show all posts

March 10, 2012

Trying TreeMix on HapMap populations

I gave TreeMix a try on samples of 30 individuals from all the HapMap-3 populations. The following was done with ~260k SNPs and default parameters. I specified -m 4, that is, I allowed for 4 migration events, and specified YRI as the outgroup to produce this result.

I'll have to explore the intricacies of the visualization parameters a little further, and see how to extract some raw numbers from the output files, but overall, the results seem quite reasonable:

If I'm interpreting the figure correctly, there seems to be:

  • progressively more drift (left-to-right) from Africa to Mexico (the only Native American-admixed population)
  • Mexicans are related to East Asian populations (CHD, CHB and especially JPT) and have a strong migration edge from a West Eurasian source (related to CEU and TSI)
  • African Americans (ASW) have a similar (but weaker) edge from West Eurasians
  • Maasai (MKK) have an even weaker edge from what appears to be a southern Caucasoid population (represented here by Tuscans TSI)
  • Gujarati Indians (GIH) are intermediate between East Asians and Europeans, and also have input from a West Eurasian-type population.


This seems like a neat tool for exploring human (and canid ;-) population history.

March 18, 2011

Analysis of 1000 Genomes + HapMap 3 data

A reader tipped me on the availability of data from the 1000 Genomes Project genotyped on the Illumina Omni 2.5 chip. Out of 2.5 million or so SNPs, there are about 720,000 with rs-numbers in the working dataset. There are a few new populations in the data:
  • GBR (Great Britain)
  • FIN (Finland)
  • IBS (Iberian Spanish)
  • CLM (Colombians)
  • MXL (Mexican Americans from Los Angeles)
  • PUR (Puerto Ricans)
I've been rebuilding my various datasets to account for common markers, high quality SNPs, and linkage disequilibrium, so this is based on about 133,000 markers. I also limited the number of individuals at 25 per population.

I took the HapMap-3 data to make sure that the integration was correct and ran various analytical techniques over the joint dataset of 17 populations and 425 individuals.

Multidimensional Scaling

As expected the three poles correspond to West Eurasians (top left, GBR, CEU, TSI), East Eurasians (bottom left, CHB, CHD, JPT), and Sub-Saharan Africans (YRI).

Other populations fall in between the three poles: for example, FIN slightly removed from West Eurasians in an East Eurasian direction, Mexicans and Gujarati Indians (GIH) in-between West and East Eurasians, African Americans (ASW) and Maasai East Africans (MKK) in-between Sub-Saharan Africans and West Eurasians.

Clusters Galore Analysis

I then used the Clusters Galore approach to cluster individuals. As I've mentioned before, individuals with quite distinct origins may overlap in the MDS representation, and the Galore approach is able to discover distinct clusters by looking at several dimensions at the same time, and using a state-of-the art clustering algorithm, MCLUST.

As can be seen in the MDS plot, Mexicans and Gujarati Indians overlap, as well as African Americans and Maasai. Obviously these populations are completely different mixtures that happen to coincide in genomic space due to the relatedness of their ancestral components that intermixed at different times and in different continents.

Here are the results of the Galore analysis. With 20 MDS dimensions retained (the maximum I considered) there were 35 clusters in the MCLUST solution that maximized the Bayes Information Criterion.


This is quite instructive:
  • Some populations (FIN and YRI) form their own very specific clusters #2 and #35
  • Some clusters join 2 or more populations. For example White Americans (CEU) and Britons (GBR) form cluster #1
  • Latinos form several clusters, especially the Mexicans. This should've been anticipated from the MDS plot where they are shown to be widely dispersed (quite variable). In essence, Latinos are not homogeneous populations but sets of individuals possessing variable admixture proportions
Note also, that some populations that are folded into a single cluster in this analysis (e.g., Spanish and Tuscans in #3) can in fact be distinguished from each other although not so easily in the first 20 dimensions considered here, as these are dominated by more salient features of the global genetic landscape.

ADMIXTURE analysis

I then ran ADMIXTURE over the dataset for K=5.

Here are the admixture proportions corresponding to this plot:


This is quite instructive with respect to the absence of particular reference populations: Finns show East Eurasian influences in the form of "Native American" (1.5%) and "East Asian" (6.2%) elements. Clearly, we don't have to imagine Native Americans moving into Finland, and these two components are standins for the Siberian ancestors of the European Finns. Similarly, Spanish show African admixture (1.6%). This is also probably due to both North and Sub-Saharan African elements, but the absence of appropriate North African references makes the distinction impossible. Finally, the Maasai show European and African admixture. This may be due to the non-emergence of a specific East African component at this level of resolution, as well as the absence of appropriate West Asian Caucasoid groups that are more likely to have influenced them. The absence of West Asian reference populations also probably affects Tuscans as their West Asian admixture may be misinterpreted as South Asian.

Here are the Fst distances between components:


This is also instructive: the South Asian component, in the absence of relatively unadmixed South Asian references is closer to Europeans than to East Asians. In fact, it is a composite of West Eurasian and indigenous South Asian population elements, the latter being distantly related to East Asians. Similarly, in the absence of Amerindian references, the Native American component (a bit of a misnomer) is equidistant to Europeans and East Asians. In fact, it is also a composite of West Eurasian and pre-Columbian American populations.


Conclusion

The Omni 2.5 data seem to work fine, and genome bloggers can anticipate good things in the future from the 1000 Genomes Project, as many more populations are in the pipeline. Clearly, the full-sequence data will probably be too much to handle for most hobbyists at the moment, but for anthropological investigations the 2.5 million SNPs will be more than enough.

The few experiments I carried out here also served to highlight the problems associated with using a limited number of reference populations. But, thankfully, this was a contrived problem aimed to make a point: there are now publicly available data for most major human populations, so the field is wide open for anyone interested in the study of human variation.

December 08, 2010

Genome-wide analysis of population structure in the Finnish Saami

The K=6 ADMIXTURE results from the supplementary material can be seen below:

This is based on ~38k SNPs.

It is unfortunate that they included Native American HGDP populations, but did not include the most relevant published data on Siberians that I first used to study population structure across north Eurasia here and here and here.

Hence, they discover a "Native American"-like component in Saami, which in all likelihood can be further resolved into Siberian-specific components utilizing the Rasmussen et al. dataset.

The "closest approximation" to the East Eurasian component in Saami in the HGDP panel are the Yakuts, but finer-scale analysis (see my previous posts) reveals that the Yakuts are made up almost entirely of an Altaic-specific component tying them to Turkic, Mongol, and Tungusic populations, while the eastern component in European Finns, Vologda Russians and Chuvashs has relationships with Central Siberians such as Kets, Selkups, and Nganasans, all of which are missing in this paper.

Hopefully this data will become publicly available online for re-analysis with the relevant populations included.

European Journal of Human Genetics advance online publication 8 December 2010; doi: 10.1038/ejhg.2010.179

A genome-wide analysis of population structure in the Finnish Saami with implications for genetic association studies

Jeroen R Huyghe et al.

The understanding of patterns of genetic variation within and among human populations is a prerequisite for successful genetic association mapping studies of complex diseases and traits. Some populations are more favorable for association mapping studies than others. The Saami from northern Scandinavia and the Kola Peninsula represent a population isolate that, among European populations, has been less extensively sampled, despite some early interest for association mapping studies. In this paper, we report the results of a first genome-wide SNP-based study of genetic population structure in the Finnish Saami. Using data from the HapMap and the human genome diversity project (HGDP-CEPH) and recently developed statistical methods, we studied individual genetic ancestry. We quantified genetic differentiation between the Saami population and the HGDP-CEPH populations by calculating pair-wise FST statistics and by characterizing identity-by-state sharing for pair-wise population comparisons. This study affirms an east Asian contribution to the predominantly European-derived Saami gene pool. Using model-based individual ancestry analysis, the median estimated percentage of the genome with east Asian ancestry was 6% (first and third quartiles: 5 and 8%, respectively). We found that genetic similarity between population pairs roughly correlated with geographic distance. Among the European HGDP-CEPH populations, FST was smallest for the comparison with the Russians (FST=0.0098), and estimates for the other population comparisons ranged from 0.0129 to 0.0263. Our analysis also revealed fine-scale substructure within the Finnish Saami and warns against the confounding effects of both hidden population structure and undocumented relatedness in genetic association studies of isolated populations.

Link

November 29, 2010

Clusters galore in HGDP panel

For background on this type of analysis, please read:

I've taken the Stanford HGDP dataset and extracted the markers common to it and to HapMap-3, Behar et al. (2010), Rasmussen et al. (2010) and the 23andMe v2 genotyping platform, or about 500k SNPs in total (I removed C/G and A/T SNPs as a precaution and flipped strand in discordant ones to the HapMap-3 standard when it differed from that of HGDP).

I removed SNPs with less than 99% genotyping rate in any of the four data sources, and about 434k SNPs were retained. Subsequently I applied linkage disequilibrium-based pruning on the HGDP set (PLINK parameter: --indep-pairwise 50 5 0.3) resulting in a final dataset of about 177k SNPs. In all analyses of the HGDP set, I followed the recommendations of Rosenberg et al. (2006) keeping the 940 individuals in common between his 952-individual panel and the Stanford data.

Subsequently I ran multidimensional-scaling (MDS) on the 940 individual/57 population/177k SNP set in PLINK, and then I applied model-based clustering as implemented in mclust over the first 42 MDS dimensions, with a maximum number of clusters = 70. In total there were 64 clusters in the optimal solution suggested by mclust (*)

Before I give the results, it might be worth looking at the pairwise MDS scatterplots for just the first 5 dimensions:

As you can see, clusteredness emerges in different dimensions. Rather than inspecting innumerable 2D combinations visually (and indeed we should 3D, etc. as well, because clusters might emerge in 3D and higher subspaces that are not discernible in 2D projections), we let mclust iterate over k, the number of clusters, and different shapes, orientations, and volumes of clusters, using the well-known EM algorithm together with the Bayes Information Criterion to choose a good solution that maximizes detail without sacrficing parsimony.

Below you can see how many individuals are assigned to each of the 64 clusters from each of the 57 populations:


This is rather astonishing. There are many clusters with 100% correspondence to HGDP populations. A few populations, mostly from regions with high levels of inbreeding are split into multiple sub-clusters, perhaps reflecting some type of tribal affinity. And, there are a few populations, such as Tuscans and North Italians that are not split. But, the fact that this was inferred from unlabeled individuals is remarkable.

I remember reading Rosenberg et al. (2002), "The genetic structure of human populations" (pdf) which used structure, a model-based algorithm on raw genetic data to infer the existence of 6 clusters corresponding to continental populations. How is it that so much more detail can be achieved today?

There are three reasons: First, dense genotyping data are much better than the few hundreds of microsatellites used by Rosenberg et al. (2002). Second, the use of dimensionality reduction in the form of MDS allowed us to remove most of the "noise" in the genotyping data and focus on dimensions capturing a lot of distinctions. Third, the use of a sophisticated clustering algorithm such as mclust which can adapt to clusters of different shape, size, and orientation without human input was able to produce this result. mclust is computationally expensive, but it works like a charm (in a few minutes) with a few dozen dimensions and about a thousand individuals, producing a clustering of obviously good value.

How to repeat the experiment

If anyone wants to repeat this experiment they can do it easily. After you've managed to put the HGDP data into PLINK ped/map format, say in files HGDP.ped and HGDP.map (or any other data for that matter), just run

> plink --cluster --mds-plot d --file HGDP

Where d is the number of dimensions you want to retain. This produces a plink.mds file in which there is a header line, and each each line after that corresponds to an individual: the individual's projection in the first d dimensions are in columns number 4 to d+3.

Then, in R, after you install and load the mclust package (see the MCLUST page for limitations on its use and licensing information), you just run:

> MDS <- read.table("plink.mds", header=T)
> maxclust <- 70
> MCLUST <- Mclust(MDS[, 4:(d+3)], G=1:maxclust)

where maxclust is the maximum number of clusters you want to consider.

Then, if you run:

> MCLUST$z

you will see a table in which each line corresponds to an individual and each column to the probability that it belongs to the i-th cluster.

There's much more that you can do in R with the mclust package, but this is enough for anyone wanting to repeat the experiment in its basic form.

(*) The number of clusters in the optimal solution varied between 11 with 2 dimensions retained and 64 with 42 dimensions retained. There was a secondary maximum of 60 clusters with 30 dimensions retained; choosing more dimensions than 42 (up to 50 that I examined), also resulted in a very high number of clusters, but I've decided to keep the one with 42 dimensions and 64 clusters as it is enough to serve the purpose of this post.

Human effective sex ratio: different at different time scales

The authors manage to harmonize the seemingly contradictory results of Keinan et al. and Hammer et al.

From the paper:
Recently, two studies estimated Q in order to detect sexbiases in similar human populations16,17 and found seeminglycontradictory conclusions.25 Using SNP data fromthe International HapMap Project,26 Keinan et al. found evidence for a male bias during the dispersal of modern humans out of Africa (Figure 1A).17 Hammer and colleagues, however, found evidence for a female biasthroughout human history in six populations from theHuman Genome Diversity Panel (HGDP) (Figure 1A).16

This figure from the paper shows the model inferred by the authors which resolves the seeming contradiction.

They write:
Long-term sex-biased processes, such as polygyny or higher female dispersal rates in ancestral human populations,likely caused the Qπ estimates found by Hammer et al.
but:
The male bias detected by Keinan et al. can be explained by a recent event associated with the out-of-Africa dispersal, as initially proposed by the authors. The Q ratios detected by Keinan et al. suggest a very strong male bias for the entire portion of the non-African lineage before the split of Asians from Europeans.

I am not entirely convinced of this explanation. The authors' model suggests a higher male/female ratio in Eurasians than in Africans due to male bias in the Eurasian lineage against an ancestral background of high female/male ratio (due to polygyny).

But, an alternative explanation is that the higher female/male ratio in Africans is due to the fact that they are descended from a relatively small number of males who overwhelmed the pre-existing African gene pool.

There are reasons to believe this is the case: Africa has the deepest lineages in the human Y-chromosome phylogeny (A and B), but the balance is made of entirely of haplogroup E chromosomes, the sister clade of Eurasian D. The extremely diverse Eurasian haplogroup F is represented only by some subclades in Africa, due to back-migration.

So, while Eurasian males are descended from the expansion of F and DE males, African males are largely descended from the expansion of E males. These are the Afrasians I've often spoken of, the common ancestors of Eurasians and Africans. In Africa, the Afrasians could take the women of the Paleo-Africans, but Eurasia was largely empty land, and the Eurasians could only take the women they've brought with them.


The American Journal of Human Genetics, 24 November 2010
doi:10.1016/j.ajhg.2010.10.021

Estimators of the Human Effective Sex Ratio Detect Sex Biases on Different Timescales

Leslie S. Emery

Determining historical sex ratios throughout human evolution can provide insight into patterns of genomic variation, the structure and composition of ancient populations, and the cultural factors that influence the sex ratio (e.g., sex-specific migration rates). Although numerous studies have suggested that unequal sex ratios have existed in human evolutionary history, a coherent picture of sex-biased processes has yet to emerge. For example, two recent studies compared human X chromosome to autosomal variation to make inferences about historical sex ratios but reached seemingly contradictory conclusions, with one study finding evidence for a male bias and the other study identifying a female bias. Here, we show that a large part of this discrepancy can be explained by methodological differences. Specifically, through reanalysis of empirical data, derivation of explicit analytical formulae, and extensive simulations we demonstrate that two estimators of the effective sex ratio based on population structure and nucleotide diversity preferentially detect biases that have occurred on different timescales. Our results clarify apparently contradictory evidence on the role of sex-biased processes in human evolutionary history and show that extant patterns of human genomic variation are consistent with both a recent male bias and an earlier, persistent female bias.

Link

November 27, 2010

Clusters galore: extremely fine-scale ancestry inference

By way of introduction, here is the command that literally made me jump from my seat:
> MCLUST <- Mclust(X,G=1:36)
Warning messages:
1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
best model occurs at the min or max # of components considered
2: In Mclust(X, G = 1:36) :
optimal number of clusters occurs at max choice
It may look like gibberish, but this is what happened when I tried to apply Model-based clustering as implemented in the R package mclust, over the first few dimensions of Multidimensional Scaling (MDS) of my standard 36-population, 692-individual dataset I have been using in the Dodecad Project.

But, let's take the story from the beginning...

The basic idea

When we look at an MDS or PCA plot, like the following MDS plot of the 11 HapMap-3 populations, it is obvious that individuals form clusters.

Here are dimensions 1 and 2:
West and East Eurasians form a cluster, and Africans form an elongated cluster towards West Eurasians. Gujaratis and Mexicans overlap between West and East Eurasians.

Here are dimensions 2 and 3:
Here, the Gujarati are shown to be quite different from the Mexicans.

We can use a standard clustering algorithm such as k-means to infer the existence of these clusters. This has two benefits:
  • We don't have to visually inspect an exponential number of 2D scatterplots
  • We can put some actual numbers on our visual impression of the existence of clusters
Actually, k-means is not a very good way to find clusters. For two reasons:
  • You have to specify k. But, how can you know which k is supported by the data, unless you look at an exponential number of 2D scatterplots?
  • k-means, using the Euclidean distance measure prefers "spherical" clusters. But, as you can see, some populations, especially recently admixed ones form elongated clusters, stretched towards their two (or more) ancestral populations.
I had previously used mclust, a model-based clustering algorithm to infer the existence of 14 different clusters in a standard worldwide craniometric dataset. This was 6 years ago, and only recently have geneticists been able to reach that level of resolution with genomic data.

But, for assessing ancestry, genomic data are obviously much better than craniometric ones: the latter reflect both genes and environmental/developmental factors.

So, while 6 years ago I had neither the computing power nor the data to push the envelope of fine-scale ancestry inference, today that's possible.

What mclust does (in a nutshell)

mclust has many bells and whistles for anyone willing to study it, but the basic idea is this: the program iterates between different k and different "forms" of clusters (e.g., spherical or ellipsoidal) and finds the best one.

Best is defined as the one that maximizes the Bayes Information Criterion. Without getting too technical, this tries to balance the "detail" of the model (how many parameters, e.g., k) it has, with its parsimony (how conservative it is in inferring the existence of phantom clusters).

How to combine mclust with PCA or MDS

mclust does not work on 0/1 binary SNP data; it needs scalar data such as skull measurements. However, that's not a problem, because you can convert 0/1 (or ACGT) SNP data into scalar variables using either MDS or PCA.

From a few hundred thousand SNPs, representing each individual, you get a few dozen numerical values placing the individual along each of the first few dimensions of MDS or PCA.

You can then run mclust over that reduced-dimensional representation. This is exactly what I attempted to do.

Clusters galore in HapMap-3 populations

I had previously used ADMIXTURE to infer admixture in the HapMap-3 populations, reaching K=9. So, naturally, I wanted to see whether the approach I just described could do as well as ADMIXTURE.

I used about 177k SNPs after quality-control and Linkage-disequilibrium based pruning and ran MDS as implemented in PLINK over a set of 275 individuals, 25 from each of the 11 HapMap-3 populations. I kept 11 dimensions, equal to the number of populations.

MDS took a few minutes to complete. Subsequently I ran mclust on the 275 individuals, allowing k to be as high as 11. Thus, if there were as many clusters as populations, I wanted mclust to find them. mclust finished running in a second. Here are the results (population averages):
The software esssentially rediscovered the existence of 10 different populations in the data, but was unable to split the Denver Chinese from the Beijing Chinese. Notice also a mysterious low-frequency component in the Maasai reminiscent of that which appeared in the previous ADMIXTURE experiment.

A question might arise why most of these populations look completely unadmixed? Even the Mexicans and African Americans get their own cluster. This is due to mclust's ability to use clusters of different shape. In particular, the "best" model was the one called "VVI", which allows for diagonal clusters of varying volume. In short, the software detected the presence of the elongated clusters associated with the admixed groups.

Indeed, the approach I am describing is not really measuring admixture. It is quantifying the probability that a sample is drawn from each of a set of inferred populations. Hence it is not really suitable for recently admixed individuals, but works like a charm in guessing the population labels of unlabeled individuals.

Clusters galore in Eurasia

Let's now see what clusters are inferred in the 36-population 692-individual dataset I commonly use in the Dodecad Project. This is done with 177k, 36 MDS dimensions retained, and allowing k to be as high as 36. This is what made me jump off my seat, and since I don't have enough colors to represent it, I'll put it in tabular form:

I could hardly believe this when I saw it, but the conclusion is inescepable: dozens of distinct populations can be inferred from unlabeled data of individuals that largely correspond, by a posteriori inspection to the individuals' population labels.

UPDATE: The above table has the average probabilities for the 36 clusters, but a better way might be to look at how many individuals are assigned to each cluster from each population:


For example, out of the 28 French individuals, 23 are assigned to cluster #1 (the French-CEU cluster), and 5 to cluster #3 (the North-Italian/Spanish/Tuscan cluster).

Some interesting observations:
  1. Some populations (e.g., CEU and French, or Belorussians and Lithuanians) remain unsplit even at K=36.
  2. Some populations are split into multiple components (e.g., Sardinians into 2)
  3. Some mini-clusters emerge (e.g., 4 clusters in Maasai, each of them corresponding to 8% of 25 = 2 individuals). These may correspond to pairs of relatives or very genetically close individuals.
Quantifying uncertainty

Naturally, we want to be able to assess how good a particular classification is. Fortunately, this is easy to do with mclust and its uncertainty feature. Looking at my 692-individual dataset, 687 have a less than 5% uncertainty level, and 682 have less than 1%. I did not inspect these fully, but some of them are "borderline" individuals who might belong on several components, e.g., a Frenchman who could either go to the CEU-French cluster #1 (36% probability) or the North/Central Italian-Spanish cluster #3 (64% probability).

Here is a dendrogram of the 36 components:



What does it all mean?

What this means, in short, is that the day of extremely fine-scale ancestry inference has arrived. We already had premonitions of this in the ability of researchers to place individuals within a few 100km of their place of birth in Europe. Now, it is clear that model-based clustering + MDS/PCA can infer ethnic/national identity, or something quite close to it.

This is obviously just the beginning. I allowed K to vary from 1 to 36, not really hoping that the optimal number of clusters would be 36. This raises the question: more than 36?

...

UPDATE:

I have followed up on this exciting new technique in the Dodecad Project blog:

November 26, 2010

ADMIXTURE on the shores of the Indian Ocean

I have applied Multidimensional Scaling and ADMIXTURE on a dataset of 15 populations:
Cambodians, Papuan, NAN_Melanesian, Gujarati, Malayan, Paniya, North_Kannadi, Sakilli, Singaporean Indians, Singaporean Chinese, Singaporean Malay, Yemenese, Saudis, Maasai, Ethiopians
These were collected from HGDP, Behar et al. (2010), HapMap-3, and the Singapore Genome Variation Project. There are 423 individuals in general (I've used samples of 25 individuals from the HapMap populations).

Here is the MDS plot:



At the bottom are the Papuans, relatively unadmixed Australoids. Close to them, but deviating towards East Eurasians are the NAN Melanesians; these are the Nasioi, Papuan speakers from Bougainville, which they inhabit together with Austronesian speakers.

At the top left are the Singaporean Chinese (CHS) who are Mongoloids. Deviating from them towards Indians are the Cambodians, a Southeast Asian group which according to physical anthropology is a basically Mongoloid population, but admixed with a pre-Mongoloid southern population element similar to that which has been preserved in India. Similar to them are the Singaporean Malay (MAS), another population that is basically Mongoloid but has absorbed Indian-like population elements.

The Singaporean Indians (INS), the North Kannada, the Sakilli and the Gujarati (GIH25) form the third population element in the region of interest.

The other two are the Caucasoids, represented here by the Saudis, with the Yemenese spread toward Africa and the more Caucasoid-admixed Ethiopians and the relatively unadmixed Maasai (MKK25).

These are the main population elements of our region of interest: Ethiopids and Australoids framing the Ocean on the west and east; the South Asians occupying India, and the Mongoloids occupying Southeast Asia, having absorbed the Indian-like former inhabitants of the region.

Here is a blowup of the middle part of the MDS plot, focusing on the Indians:
It's fairly clear that North Kannada and Sakilli (South Indians) occupy a place that is furthest from Caucasoids, while Gujarati and Singaporean Indians are positioned towards Caucasoids (to the top-right).

Let's now turn to ADMIXTURE to confirm the visual impression from the MDS:

Notice the following components:
  1. Light blue, Indian
  2. Dark blue, East African
  3. Light green, Southeast Asian
  4. Dark green, Chinese Mongoloid
  5. Pink, Arabian Caucasoid
  6. Red, Australoid
Finally, here is the table of Fst distances between these 6 inferred components:

Notice the small distance (0.023) between Chinese and Southeast Asian Mongoloids. The Indian component is equidistant between Caucasoids and Mongoloids, but as the MDS plot makes clear, and as the study of Y-chromosome and mtDNA polymorphisms have shown, the distinctive component in Indians is sui generis and not the result of admixture between Caucasoids and Mongoloids. And, finally, the Australoid component is clearly distant from all of the above.

November 25, 2010

Some Indians as genetically diverse as Africans, recent Out of Africa in serious trouble?

Razib alerts me to a very interesting new paper, which discovered that some Indian populations are more diverse than Africans in a sequenced 100kb region. I will have much more to say about this once I digest it fully, but as I said in my recent review of the Oceania paper, I don't believe in long "interludes" of humans living Africa and then spending tens of thousands of years camping in one place before starting to expand again. I don't believe that there is evidence for Neandertal admixture in Eurasians either; the title of my post hints at what I believe. Update to follow.

Thanks to the Jorde Lab for putting up their genotype data easily accessible online! They're an example for others to follow.

UPDATE:

Here is the crux of the paper:
As previously observed, heterozygosity (a measure of genetic diversity) decreases with distance from East Africa (represented here by the Luhya LWK HapMap poopulation). The only trouble is, that this pattern disappears once the Indian populations are included in the analysis.

Things are even worse though: the Luhya are an admixed population. Thus, their level of heterozygosity is inflated because of their relatively recent admixture associated with the spread of Bantu languages. Remove them, and it's clear the pattern of diminution of genetic diversity from East Africa completely disappears. Indeed, I am convinced that this pattern may be completely due to the admixed status of East Africans; the Maasai are another HapMap population from East Africa that seems to be missing from this analysis, and it is less heterozygous than the Luhya.

Even though the pattern of diminution of genetic diversity from East Africa (in autosomal genes, at least) may be largely due to the admixed status of East Africans, the same could be true for the Indian groups, who are largely composed of an indigenous "Ancestral South Indian", and an invasive "Ancestral North Indian" component. But, the point is that these two groups must have been substantially differentiated to produce a larger level of heterozygosity than in the Africans.

A caveat should be registered: genetic diversity in African hunter-gatherers (Bushmen and Pygmies) may be even higher than in the Yoruba and Luhya. Also, the mtDNA phylogeny is pretty unambiguous about the matrilineal origin of humanity being in Africa. And, the earliest known fossils of anatomically modern humans are in Africa. Thus, some kind of Out-of-Africa scenario still finds support in the data.

What does no longer find support in the data is the idea of a recent Out-of-Africa exodus 40-60 thousand years ago. The authors of the current paper:
the divergence time between African and the ancestral Eurasian population (88-112 kya, CIs: 63-150 kya) is much older than the divergence time among the Eurasian groups (27-39 kya, CI: 20-59 kya).
A divergence between Africans and Eurasians 100ky is consistent with the paleoanthropological finds from the Levant and China, showing the presence of anatomically modern humans thousands of kilometers apart at that time outside Africa. If there was an Out of Africa, it happened 100 thousand years ago.

The second important point is that the supposed maintainance of a Eurasian population outside Africa in the Levant for tens of thousands of years before the breakup of the Eurasians:

There are serious reasons to doubt this hiatus:

First, the presence of AMH in China in the Levant and China 100,000 years ago is hardly consistent with the maintainance of a geographically circumscribed population of Eurasians in the Levant until 40,000 years ago.

Second, it is hardly parsimonious that such a population would maintain itself in a geographically circumscribed area for so long. If they moved from East Africa to the Near East, why on earth would they stop there?

In my opinion two underappreciated factors should be considered:
  • Gene flow within Eurasia reduces divergence times between Eurasians; West, South, and East Eurasians did not branch out from a common ancestor; there were episodes of gene flow between them, some of them very recent, some of them beyond any record. Such lateral gene flow did not abolish differences between them, but it would have reduced the inferred divergence time.
  • Gene flow between Afrasians (i.e., Eurasians' unadmixed ancestors in East Africa) and other Palaeoafricans inhabiting other parts of the continent would have increased the inferred divergence time between Africans and Eurasians.
These two factors might suffice to explain the observed pattern, without invoking a long hiatus.

Genome Biology 2010, 11:R113 doi:10.1186/gb-2010-11-11-r113

Genetic diversity in India and the inference of Eurasian population expansion

Jinchuan Xin et al.

Abstract (provisional)
Background

Genetic studies of populations from the Indian subcontinent are of great interest because of India's large population size, complex demographic history, and unique social structure. Despite recent large-scale efforts in discovering human genetic variation, India's vast reservoir of genetic diversity remains largely unexplored.

Results
To analyze an unbiased sample of genetic diversity in India and to investigate human migration history in Eurasia, we resequenced one 100 kb ENCODE region in 92 samples collected from three castes and one tribal group from the state of Andhra Pradesh in south India. Analyses of the four Indian populations, along with eight HapMap populations (692 samples), showed that 30% of all SNPs in the south Indian populations are not seen in HapMap populations. Several Indian populations, such as the Yadava, Mala/Madiga, and Irula, have nucleotide diversity levels as high as those of HapMap African populations. Using unbiased allele-frequency spectra, we investigated the expansion of human populations into Eurasia. The divergence time estimates among the major population groups suggest that Eurasian populations in this study diverged from Africans during the same time frame (approximately 90-110 thousand years ago). The divergence among different Eurasian populations occurred more than 40,000 years after their divergence with Africans.

Conclusions
Our results show that Indian populations harbor large amounts of genetic variation that have not been surveyed adequately by public SNP discovery efforts. Our data also support a delayed expansion hypothesis in which an ancestral Eurasian founding population remained isolated long after the out-of-Africa diaspora, before expanding throughout Eurasia.

Link

October 10, 2010

ADMIXTURE on African HapMap populations

Here is the result of running ADMIXTURE on the three African HapMap-3 populations, using about 440K SNPs, including Tuscans as a non-African group.

The Tuscans are in purple and show no trace of African admixture. All the other populations are separated: red: Luhya (Bantu); green: Maasai (Nilotes); Yoruba (Niger-Congo).

The two east African groups show asymmetrical affinities: the Maasai have some Luhya red, while the Luhya have little Maasai green, while they have substantial West African turqoise, consistent with the origin of their Bantu language.

October 08, 2010

ADMIXTURE on HapMap 3 populations

Here is the result of running ADMIXTURE on about ~50K markers from the HapMap 3 populations. I'll annotate the final K=9 run; the rest are given at the end:

I list the distinctive colors for the populations, Left-to-Right. The minor components are easy enough to pick up and as expected:

ASW (A): African ancestry in Southwest USA [Sub-Saharan blue]

CEU (C): Utah residents with Northern and Western European ancestry from the CEPH collection [European yellow]

CHB (H): Han Chinese in Beijing, China [East Asian orange]
CHD (D): Chinese in Metropolitan Denver, Colorado [East Asian orange]

GIH (G): Gujarati Indians in Houston, Texas [South Asian purple]

JPT (J): Japanese in Tokyo, Japan [East Asian light green]

LWK (L): Luhya in Webuye, Kenya [East African bright green]

MEX (M): Mexican ancestry in Los Angeles, California [Amerindian pink]

MKK (K): Maasai in Kinyawa, Kenya [East African light blue]

TSI (T): Tuscan in Italy [European yellow]

YRI (Y): Yoruban in Ibadan, Nigeria (West Africa) [Sub-Saharan blue]

The rest of the runs for K=3 to K=8 are below:

K=3: Notice: Asian red+Caucasoid blue on Gujarati Indians and Mexicans. At this level of resolution, these two populations look similar. Notice presence of blue in East Africans but not Yorubans (green at the end).
K=4: East Africans get their own (yellow) cluster. Notice the diminution of the Sub-Saharan (purple) element, relative to the previous figure. This is due to the fact that the East African element is intermediate between Caucasoids and Sub-Saharans and "eats up" the other two elements, although residual Caucasoid red and Sub-Saharan purple remains. The tripartite origin of Mexicans is especially visible in this plot, with components being in order of European, East Eurasian, Sub-Saharan.
K=5: Gujarati Indians now get their own (purple) cluster. Here the difference between Luhya (mostly Sub-Saharan blue + East African yellow) and Maasai (the reverse) is quite striking. The former are Bantu speakers and thus not indigenous to east Africa.
K=6: Mexicans get their own cluster (blue) reflecting their Amerindian, rather than East Asian ancestry which could not be resolved in the previous figures.
K=7: the Luhya get their own cluster, splitting off from the Maasai. There are now 3 components in Africa centered on Yorubans (light blue), Luhya (very light green) and Maasai (red).
K=8: A low-frequency element appears in Maasai that is hard to interpret; it is preserved in the next and final K=9 run (shown at the beginning of the post), in which the Japanese and Chinese are split off.

March 04, 2009

Geographical affinities of HapMap samples

PLoS ONE doi:10.1371/journal.pone.0004684

Geographical Affinities of the HapMap Samples

Miao He et al.

Abstract

Background

The HapMap samples were collected for medical-genetic studies, but are also widely used in population-genetic and evolutionary investigations. Yet the ascertainment of the samples differs from most population-genetic studies which collect individuals who live in the same local region as their ancestors. What effects could this non-standard ascertainment have on the interpretation of HapMap results?

Methodology/Principal Findings

We compared the HapMap samples with more conventionally-ascertained samples used in population- and forensic-genetic studies, including the HGDP-CEPH panel, making use of published genome-wide autosomal SNP data and Y-STR haplotypes, as well as producing new Y-STR data. We found that the HapMap samples were representative of their broad geographical regions of ancestry according to all tests applied. The YRI and JPT were indistinguishable from independent samples of Yoruba and Japanese in all ways investigated. However, both the CHB and the CEU were distinguishable from all other HGDP-CEPH populations with autosomal markers, and both showed Y-STR similarities to unusually large numbers of populations, perhaps reflecting their admixed origins.

Conclusions/Significance

The CHB and JPT are readily distinguished from one another with both autosomal and Y-chromosomal markers, and results obtained after combining them into a single sample should be interpreted with caution. The CEU are better described as being of Western European ancestry than of Northern European ancestry as often reported. Both the CHB and CEU show subtle but detectable signs of admixture. Thus the YRI and JPT samples are well-suited to standard population-genetic studies, but the CHB and CEU less so.

Link

February 17, 2009

Y-chromosomes of Mormon founders and HapMap Utahns

Feel free to post in the comments, any information you can deduce from these haplotypes (haplogroup, origin, etc.)
 Am J Hum Genet. 2009 Feb;84(2): 251-8

Inferential genotyping of Y chromosomes in Latter-Day Saints founders and comparison to Utah samples in the HapMap project.

Gitschier J.

One concern in human genetics research is maintaining the privacy of study participants. The growth in genealogical registries may contribute to loss of privacy, given that genotypic information is accessible online to facilitate discovery of genetic relationships. Through iterative use of two such web archives, FamilySearch and Sorenson Molecular Genealogy Foundation, I was able to discern the likely haplotypes for the Y chromosomes of two men, Joseph Smith and Brigham Young, who were instrumental in the founding of the Latter-Day Saints Church. I then determined whether any of the Utahns who contributed to the HapMap project (the "CEU" set) is related to either man, on the basis of haplotype analysis of the Y chromosome. Although none of the CEU contributors appear to be a male-line relative, I discovered that predictions could be made for the surnames of the CEU participants by a similar process. For 20 of the 30 unrelated CEU samples, at least one exact match was revealed, and for 17 of these, a potential ancestor from Utah or a neighboring state could be identified. For the remaining ten samples, a match was nearly perfect, typically deviating by only one marker repeat unit. The same query performed in two other large databases revealed fewer individual matches and helped to clarify which surname predictions are more likely to be correct. Because large data sets of genotypes from both consenting research subjects and individuals pursuing genetic genealogy will be accessible online, this type of triangulation between databases may compromise the privacy of research subjects.

Link

December 06, 2008

Genetic structure in East Asia using 200K SNPs

The table of paired Fst values for East Asian populations is here. The PCA plots are seen on the left.

PLoS ONE 3(12): e3862. doi:10.1371/journal.pone.0003862

Analysis of East Asia Genetic Substructure Using Genome-Wide SNP Arrays

Chao Tian et al.

Abstract

Accounting for population genetic substructure is important in reducing type 1 errors in genetic studies of complex disease. As efforts to understand complex genetic disease are expanded to different continental populations the understanding of genetic substructure within these continents will be useful in design and execution of association tests. In this study, population differentiation (Fst) and Principal Components Analyses (PCA) are examined using >200 K genotypes from multiple populations of East Asian ancestry. The population groups included those from the Human Genome Diversity Panel [Cambodian, Yi, Daur, Mongolian, Lahu, Dai, Hezhen, Miaozu, Naxi, Oroqen, She, Tu, Tujia, Naxi, Xibo, and Yakut], HapMap [ Han Chinese (CHB) and Japanese (JPT)], and East Asian or East Asian American subjects of Vietnamese, Korean, Filipino and Chinese ancestry. Paired Fst (Wei and Cockerham) showed close relationships between CHB and several large East Asian population groups (CHB/Korean, 0.0019; CHB/JPT, 00651; CHB/Vietnamese, 0.0065) with larger separation with Filipino (CHB/Filipino, 0.014). Low levels of differentiation were also observed between Dai and Vietnamese (0.0045) and between Vietnamese and Cambodian (0.0062). Similarly, small Fst's were observed among different presumed Han Chinese populations originating in different regions of mainland of China and Taiwan (Fst's less than 0.0025 with CHB). For PCA, the first two PC's showed a pattern of relationships that closely followed the geographic distribution of the different East Asian populations. PCA showed substructure both between different East Asian groups and within the Han Chinese population. These studies have also identified a subset of East Asian substructure ancestry informative markers (EASTASAIMS) that may be useful for future complex genetic disease association studies in reducing type 1 errors and in identifying homogeneous groups that may increase the power of such studies.

Link

October 24, 2008

Genetic structure in Northern Europe with 250K SNPs

A new study on genetic structure in Northern Europeans has appeared in PLoS ONE. Below is the STRUCTURE results for the Northern European populations. At K=2, the two clusters are centered on Eastern Finns and Germans-Brits, with Western Finns being intermediate and Swedes closer to the German-Brit (green) cluster). At K=3 Nordic populations are split into two clusters centering on Swedes and Eastern Finns. The pattern for K=4 is less distinct, except that the German-Brit cluster is split into blue and green components that don't appear to have any population specificity.


Interestingly, the researchers also carried out an analysis of the populations which included the HapMap populations:
When data from HapMap Han Chinese+Japanese and Yoruba individuals was included in the analysis, the MDS plot of IBS formed a triangle of the three continents in the first two dimensions, with the third dimension separating the European populations clinally from each other (Fig. S3). In the histograms of IBS between the five European populations and each HapMap population (Fig. 4a), the studied populations were most similar with the CEU and least similar with YRI. Interestingly, the similarity with the Asians varied between populations, being higher for Eastern Finns, Western Finns and Swedes than for the Germans and British (p less than 10−14 for all comparisons except for GER and BRI whose distributions did not differ). The same pattern was also observed when comparing the allele frequencies in the study populations and in CEU and CHB+JPT: the Eastern Finns had the largest proportion of SNPs deviating towards the Asian frequencies (Table S2; p less than 10−5), also when markers with smallest differences were excluded (data not shown).
They were able to differentiate between the effects of genetic drift and eastern influence by looking at the direction of the divergence:
To study the extent of eastern influence, we counted in each of the five European populations the number of markers where the population's allele frequency and the CHB+JPT allele frequency deviated from the CEU allele frequency to the same direction, and the number of markers where the allele frequencies deviated in opposite directions. We then compared the numbers to the null hypothesis that all the five populations stem from the same proto-European population (approximated by the CEU frequencies) from which they have subsequently diverged via genetic drift in the absence of admixture. In such a case, one would expect the number of markers drifting into a given direction (e.g. towards the Asian frequencies) to be similar across the populations, whereas a varying degree of eastern admixture in each population would result in disparate marker proportions. Using the number of deviating markers instead of the absolute size of the deviations should even out some of the effects of differing extent of drift in the populations.

This parallels my comment of an earlier study:
Under the theory of "drift due to isolation", the Finns might be distant from other Europeans, but not specifically at an East Eurasian direction.
The paper also discusses genetic structure within Finland:
The information about the grandparental birthplaces of the Finnish samples enabled a more detailed analysis of population structure within Finland. In the multidimensional scaling plot of IBS within Finland (Fig. 2c,d, Fig. S1b), the first dimension showed the division to Eastern and Western Finland; the Häme samples settled between the clusters. The second dimension showed a north-south gradient within Eastern and the third dimension within Western Finland. Here the Swedish-speaking Ostrobothnians showed no separation from their Finnish-speaking neighbours, whereas in the MDS plot of the European populations, the Finnish samples closest to the Swedes were almost exclusively Swedish-speakers (data not shown), and in the Structure analysis the Swedish-speaking Finns showed twice as large an admixture with the Sweden-dominated cluster as the other Western Finnish samples did (48.9% versus 24.6%, data not shown).


PLoS ONE doi: 10.1371/journal.pone.0003519

Genome-Wide Analysis of Single Nucleotide Polymorphisms Uncovers Population Structure in Northern Europe

Elina Salmela et al.

Abstract

Background
Genome-wide data provide a powerful tool for inferring patterns of genetic variation and structure of human populations.

Principal Findings
In this study, we analysed almost 250,000 SNPs from a total of 945 samples from Eastern and Western Finland, Sweden, Northern Germany and Great Britain complemented with HapMap data. Small but statistically significant differences were observed between the European populations (FST = 0.0040, p less than 10−4), also between Eastern and Western Finland (FST = 0.0032, p less than 10−3). The latter indicated the existence of a relatively strong autosomal substructure within the country, similar to that observed earlier with smaller numbers of markers. The Germans and British were less differentiated than the Swedes, Western Finns and especially the Eastern Finns who also showed other signs of genetic drift. This is likely caused by the later founding of the northern populations, together with subsequent founder and bottleneck effects, and a smaller population size. Furthermore, our data suggest a small eastern contribution among the Finns, consistent with the historical and linguistic background of the population.

Significance
Our results warn against a priori assumptions of homogeneity among Finns and other seemingly isolated populations. Thus, in association studies in such populations, additional caution for population structure may be necessary. Our results illustrate that population history is often important for patterns of genetic variation, and that the analysis of hundreds of thousands of SNPs provides high resolution also for population genetics.

Link

September 28, 2008

Integrated detection of SNPs and Copy number variation

While SNPs are single-letter changes in the genetic code, copy number variation (CNV) involves the multiplication (or deletion) of entire chunks of DNA. While in a SNP, the allele is a single letter (e.g., C or T), in CNVs, the allele is an integer number of how many copies of the particular chunk of DNA an individual has. What this paper shows is that most human CNVs don't appear to be "fresh" changes but rather old "frozen" changes that are linked to specific SNPs or combinations of SNPs. Practically, this means that a CNV allele can be inferred fairly accurately by looking at SNPs in the region of the chromosome where it occurs.

Nature Genetics 40, 1166 - 1174 (2008)

Integrated detection and population-genetic analysis of SNPs and copy number variation

Steven A McCarroll et al.

Abstract

Dissecting the genetic basis of disease risk requires measuring all forms of genetic variation, including SNPs and copy number variants (CNVs), and is enabled by accurate maps of their locations, frequencies and population-genetic properties. We designed a hybrid genotyping array (Affymetrix SNP 6.0) to simultaneously measure 906,600 SNPs and copy number at 1.8 million genomic locations. By characterizing 270 HapMap samples, we developed a map of human CNV (at 2-kb breakpoint resolution) informed by integer genotypes for 1,320 copy number polymorphisms (CNPs) that segregate at an allele frequency >1%. More than 80% of the sequence in previously reported CNV regions fell outside our estimated CNV boundaries, indicating that large (>100 kb) CNVs affect much less of the genome than initially reported. Approximately 80% of observed copy number differences between pairs of individuals were due to common CNPs with an allele frequency >5%, and more than 99% derived from inheritance rather than new mutation. Most common, diallelic CNPs were in strong linkage disequilibrium with SNPs, and most low-frequency CNVs segregated on specific SNP haplotypes.

Link

September 27, 2008

More ASHG 2008 abstracts

The previous batch is here.

Analysis of East Asia Genetic Substructure: Population Differentiation and PCA Clusters Correlate with Geographic Distribution
Accounting for genetic substructure within European populations has been important in reducing type 1 errors in genetic studies of complex disease. As efforts to understand complex genetic disease are expanded to other continental populations an understanding of genetic substructure within these continents will be useful in design and execution of association tests. In this study, population differentiation(Fst) and Principal Components Analyses(PCA) are examined using >200K genotypes from multiple populations of East Asian ancestry(total 298 subjects). The population groups included those from the Human Genome Diversity Panel[Cambodian(CAMB), Yi, Daur, Mongolian(MGL), Lahu, Dai, Hezhen, Miaozu, Naxi, Oroqen, She, Tu, Tujia, Naxi, and Xibo], HapMap(CHB and JPT), and East Asian or East Asian American subjects of Vietnamese(VIET), Korean(KOR), Filipino(FIL) and Chinese ancestry. Paired Fst(Wei and Cockerham) showed close relationships between CHB and several large East Asian population groups(CHB/KOR, 0.0019; CHB/JPT, 00651; CHB/VIET, 0.0065) with larger separation with FIL(CHB/FIL, 0.014). Low levels of differentiation were also observed between DAI and VIET(0.0045) and between VIET and CAMB(0.0062). Similarly, small Fsts were observed among different presumed Han Chinese populations originating in different regions of mainland of China and Taiwan. For example, the four For PCA, the first two PCs showed a pattern of relationships that closely followed the geographic distribution of the different East Asian populations.corner groups were JPT, FIL, CAMB and MGL with the CHB forming the center group, and KOR was between CHB and JPT. Other small ethnic groups were also in rough geographic correlation with their putative origins. These studies have also enabled the selection of a subset of East Asian substructure ancestry informative markers(EASTASAIMS) that may be useful for future genetic association studies in reducing type 1 errors and in identifying homogeneous groups.

Worldwide Population Structure using SNP Microarray Genotyping
We genotyped 348 individuals sampled from 24 populations world-wide using the Affymetrix 250k NspI microarray chip. For context, we added matching genotypes from 210 HapMap individuals for a total of 250,823 loci genotyped in 543 individuals from 28 populations. We included populations from India and Daghestan to provide detail between the genetic poles of Western Europe, East Asia, and sub-Sahara Africa. With so many markers, principal components analyses reveal genetic differentiation between almost all identified populations in our sample. Northern and southern European populations (FST = 0.004, p <0.01) are statistically distinguishable, as are upper and lower caste groups in India (FST = 0.005, p <0.01). All individuals are accurately classified into continental groups, and even between closely-related populations, genetic- and self-classifications conflict for only a minority of individuals (e.g. ~2% between upper and lower Indian castes; k-means clustering.) As expected, the HapMap CHB+JPT, CEU, and YRI samples are most similar to our east Asian, west European, and African samples, respectively. The HapMap CEU samples and our northern European ancestry samples were both collected from Utah. Although individual samples cannot be reliably classified into their collection of origin, the groups are statistically distinguishable despite their high similarity (FST = 0.0005, n.s.). Our Japanese group is also statistically distinguishable from the HapMap JPT group (FST = 0.006, p <0.01), and in this comparison, most samples can be correctly classified. With such large numbers of genotypes, significant differences can be found even between very similar population samplings. Our results provide guidelines for researchers in selecting suitable control populations for case-control studies.


Frequency distribution and selection in 4 pigmentation genes in Europe
Pigmentation is one of the more obvious forms of variation in humans, particularly in Europeans where one sees more within group variation in hair and eye pigmentation than in the rest of the world. We studied 4 genes (SLC24A5, SLC45A2, OCA2 and MC1R) that are believed to contribute to the pigment phenotypes in Europeans. SLC24A5 has a single functional variant that leads to lighter skin pigmentation. Data on 83 populations worldwide (including 55 from our lab) show the variant (at rs1426654) has almost reached fixation in Europe, Southwest Asia, and North Africa, has moderate to high frequencies (.2-.9) throughout Central Asia, and has frequencies of .1-.3 in East and South Africa. The variant is essentially absent elsewhere. SLC45A2 also has a single functional variant (at rs16891982) associated with light skin pigmentation in Europe. Data on 84 populations worldwide show the light skin allele is nearly fixed in Northern Europe but has lower frequencies in Southern Europe, the Middle East and Northern Africa. In Central Asia the frequency of the SLC45A2 variant declines more quickly than the SLC24A5 variant. It is absent in both East and South Africa. In OCA2 we typed 4 SNPs (rs4778138, rs4778241, rs7495174, rs12913832) with a haplotype associated with blue eyes in Europeans. This haplotype shows a Southeastern to Northwestern pattern in Europe with frequencies of .25 (.05 homozygous) in the Adygei to .85 (.75 homozygous) in the Danes. In MC1R we typed 5 SNPs (rs3212345, rs3212357, rs3212363, C_25958294_10, rs7191944) that cover the entire MC1R gene and found a predominantly European haplotype that ranges in frequency from .35 to .65 in Europe, reaching its highest levels in Southwest Asia and Northwestern Europe. Extended Haplotype Heterozygosity (EHH) and normalized Haplosimilarity (nHS) show evidence of selection at SLC24A5 in not only our European and Southwest Asian populations but also our East African populations. Neither SLC45A2 or OCA2 showed evidence of selection in either test. MC1R did not show evidence of selection for our European specific haplotype but we did see some evidence both upstream and downstream in our nHS test in Europe.

Using principal components analysis to identify candidate genes for natural selection.
Genetic markers that differentiate populations are excellent candidates for natural selection due to local adaptation, and may shed light into physiological pathways that underlie disorders with varying frequencies around the world. Principal Components Analysis (PCA) has emerged as a powerful tool for the characterization and analysis of the structure of genomewide datasets. In prior work, we described an algorithm that can be used to select small subsets of genetic markers (SNPs) that correlate well with population structure, as captured by PCA. Our method can be used to detect SNPs that differentiate individuals from different geographic regions, or even neighboring subpopulations. We set out to explore the nature and properties of the genes where population-differentiating SNPs reside, by analyzing the publicly available Human Genome Diversity Panel dataset (650,000 SNPs for 1,043 individuals, 51 populations). Applying our SNP selection algorithms, we chose small subsets of SNPs that almost perfectly reproduce worldwide population structure as identified by PCA. We determined SNP panels both for population differentiation within seven geographic regions, as well as around the globe. We then explored the hypothesis that the selected SNPs attained their current worldwide allele frequency patterns as a response to the pressure of natural selection. Comparing our lists to recently published reports, we found a significant overlap with other genomewide scans for selection, thus validating our hypothesis. For example, EDAR (involved in the development of hair follicles) harbors the most differentiating SNPs in our world-wide panels. SNPs located in genes that are involved in skin and eye pigmentation (OCA2, MYO5C, HERC1, HERC2) are also among the top population differentiating markers. In East Asia, SNPs residing at the ADH cluster appear among the most important SNPs for population structure, while, in Europe, the same is true for genes that are involved in immune response to pathogens (CR1, DUOX2, TLR, and HLA). Finally, a comprehensive gene ontology analysis is presented.