Showing posts with label 23andMe. Show all posts
Showing posts with label 23andMe. Show all posts

October 20, 2014

Ancestry Composition preprint

This is one of the main ancestry tools of 23andMe so it is nice to see its methodology described in detail.

bioRxiv http://dx.doi.org/10.1101/010512

Ancestry Composition: A Novel, Efficient Pipeline for Ancestry Deconvolution

Eric Y Durand et al.

Ancestry deconvolution, the task of identifying the ancestral origin of chromosomal segments in admixed individuals, has important implications, from mapping disease genes to identifying candidate loci under natural selection. To date, however, most existing methods for ancestry deconvolution are typically limited to two or three ancestral populations, and cannot resolve contributions from populations related at a sub-continental scale. We describe Ancestry Composition, a modular three-stage pipeline that efficiently and accurately identifies the ancestral origin of chromosomal segments in admixed individuals. It assumes the genotype data have been phased. In the first stage, a support vector machine classifier assigns tentative ancestry labels to short local phased genomic regions. In the second stage, an autoregressive pair hidden Markov model simultaneously corrects phasing errors and produces reconciled local ancestry estimates and confidence scores based on the tentative ancestry labels. In the third stage, confidence estimates are recalibrated using isotonic regression. We compiled a reference panel of almost 10,000 individuals of homogeneous ancestry, derived from a combination of several publicly available datasets and over 8,000 individuals reporting four grandparents with the same country-of-origin from the member database of the personal genetics company, 23andMe, Inc., and excluding outliers identified through principal components analysis (PCA). In cross-validation experiments, Ancestry Composition achieves high precision and recall for labeling chromosomal segments across over 25 different populations worldwide.

Link

September 18, 2014

23andMe mega-study on different American groups

It's great to see that the massive dataset of 23andMe was used for a study like this that seeks to capture the landscape of ancestry of different American groups.

First, distribution of ancestry in African Americans:


The higher fraction of African ancestry in the south and of European ancestry in the north, shouldn't be very surprising. There are some interesting loci of higher "Native American" ancestry; most African Americans don't seem to have a lot of this ancestry, but some apparently do.

Second, distribution of ancestry in "Latinos":


To my eye, this seems like more African ancestry in the eastern parts (presumbly from Caribbean-type Latinos?) and more Native American ancestry in the west.

Third, distribution of ancestry in European Americans:


Overall, it seems that relatively few (less than 5%) of European Americans have more than 2% either African or Native American ancestry in any of the states, so the breakdown of European ancestry into various subgroups  is perhaps more interesting.

The distribution of African ancestry in European and African Americans is also interesting:


The existence of "African Americans" with virtually no African ancestry and of "European Americans" with as much as half African ancestry is probably due to either misreporting or some quite strange self-perception issues. The bulk of the African ancestry in European Americans seems to be in the sub-10% range (equivalent to less than 1 great grandparent). It is possible that many of these individuals might not even be aware of the existence of such ancestors.

bioRxiv doi: http://dx.doi.org/10.1101/009340

The genetic ancestry of African, Latino, and European Americans across the United States.

Katarzyna Bryc, Eric Durand, J Michael Macpherson, David Reich, Joanna Mountain

Over the past 500 years, North America has been the site of ongoing mixing of Native Americans, European settlers, and Africans brought largely by the Trans-Atlantic slave trade, shaping the early history of what became the United States. We studied the genetic ancestry of 5,269 self-described African Americans, 8,663 Latinos, and 148,789 European Americans who are 23andMe customers and show that the legacy of these historical interactions is visible in the genetic ancestry of present-day Americans. We document pervasive mixed ancestry and asymmetrical male and female ancestry contributions in all groups studied. We show that regional ancestry differences reflect historical events, such as early Spanish colonization, waves of immigration from many regions of Europe, and forced relocation of Native Americans within the US. This study sheds light on the fine-scale differences in ancestry within and across the United States, and informs our understanding of the relationship between racial and ethnic identities and genetic ancestry.

Link

March 04, 2014

Admixture in US populations

An interesting blog post from 23andMe:
In an update to that work, our researcher Kasia Bryc found that about about 4 percent of whites have at least 1 percent or more African ancestry.

Although it is a relatively small percentage, the percentage indicates that an individual with at least 1 percent African ancestry had an African ancestor within the last six generations, or in the last 200 years. This data also suggests that individuals with mixed parentage at some point were absorbed into the white population.

Looking a little more deeply into the data, Kasia also found that the percentage of whites with hidden African ancestry differed significantly from state-to-state. Southern states with the highest African American populations, tended to have the highest percentages of hidden African ancestry. In South Carolina at least 13 percent of self-identified whites have 1 percent or more African ancestry, while in Louisiana the number is a little more than 12 percent. In Georgia and Alabama the number is about 9 percent. The differences perhaps point to different social and cultural histories within the south.
and:
Previous published studies estimate that on average African Americans had about 82 percent African ancestry and about 18 percent European ancestry. But in self-identified African Americans in 23andMe’s database, Kasia found the average amount of African ancestry was closer to 73 percent.
I don't think that is necessarily the average percentage in the general African American population as the subset of African Americans who take 23andMe tests may not be representative (e.g., it may come more from cities where African Americans may have more opportunity to admix with European Americans).

and:
On average Latinos had about 70 percent European ancestry, 14 percent Native American ancestry and 6 percent African ancestry. The remainder ancestry is difficult to assign because the DNA is either shared by a number of different populations around the world, or because it’s from understudied populations, such as Native Americans. Obviously that large “unassigned” percentage means that those “averages” could be higher. As with African Americans, looking at the regional and state-to-state numbers for self-identified Latinos, the differences are striking. 
...

For example, some Latinos have no discernible Native American ancestry, while in others have as much as 50 percent of the ancestry being Native American. Latinos in states in the Southwest, bordering Mexico — New Mexico, Texas, California and Arizona — have the greatest percentage of Native American ancestry. Latinos in states with the largest proportion of African Americans in their population — South Carolina, Louisiana and Alabama — have the highest percentage of African Ancestry.
23andMe may have a couple of orders of magnitude more sampled individuals than anything that appears in most published studies and it's great to see this being put to good use.

It'd be great if someone at 23andMe did some more analyses over their huge database. I can only imagine what a flashPCA with half a million individuals from around the world would look like; even if it told us nothing new about human history it would be quite a cool picture to look at.

November 06, 2013

Dealing with false positive IBD segments

False positive IBD segments are a real problem for those who wish to use genotype data to establish family connections with distant relatives. Traditionally, this involves finding shared common IBD segments, and then comparing genealogies to find potential common ancestors from which these segments could be inherited. IBD is also used in population genetics (e.g., Coop & Ralph 2013). There is an obvious tradeoff, since sloppy IBD detection may enable more genealogical links to be established but adds to the burden of establishing the validity of these links (the infamous "ignoring contact requests from potential genetic cousins" issue). It will be nice if this technology finds its way to end users who stand to most benefit from it.

arXiv:1311.1120 [q-bio.PE]

Reducing pervasive false positive identical-by-descent segments detected by large-scale pedigree analysis

Eric Y. Durand, Nicholas Eriksson, Cory Y. McLean

(Submitted on 5 Nov 2013)

Analysis of genomic segments shared identical-by-descent (IBD) between individuals is fundamental to many genetic applications, but IBD detection accuracy in non-simulated data is largely unknown. Using 25,432 genotyped European individuals, and exploiting known familial relationships in 2,952 father-mother-child trios contained therein, we identify a false positive rate over 67% for short (2-4 centiMorgan) segments. We introduce a novel, computationally-efficient, haplotype-based metric that enables accurate IBD detection on population-scale datasets.

Link

January 26, 2013

Ancestry Composition to be fixed

From the explanation at the relevant thread:
Ancestry Composition (AC) works by learning (training) a set of useful features from reference individuals with known ancestry (the training set) and then using these features to predict the ancestry of our customers.

Our set of reference individuals consists in part of customers who reported their 4 grandparents were born in the same country. Remember that we also remove the outliers, or people whose genetic ancestry doesn't match their survey answers. From this set, AC learns to associate certain haplotypes with their geographical origin. AC is then able to recognize similar haplotypes and thus to predict the ancestry of other customers.

However, when predicting the ancestry of reference individuals, AC suffers from overfitting, a problem common to many supervised learning methods. As a consequence, AC predicts the ancestry of most reference individuals as being 100% from their grandparents’ birthplace.

We addressed this issue using a method inspired from cross-validation. We divided the training set into 5 folds, each containing 20% of the reference individuals. We then trained 5 AC models in which each fold in turn is excluded from the set of reference individuals. So each of these models is learned using 80% of the reference individuals. Additionally, we retain the model that was trained using all the reference individuals. From this process, we end up with 6 different models from which we can predict the ancestry of our customers.

Now, when predicting the ancestry of a customer, we start by figuring out if he/she is a reference individual. If yes, we identify the fold in which the customer belongs, and we use the corresponding model for prediction. If not, we use the fold containing all of the reference data. This way, we ensure that AC was never trained using the haplotypes of the individual it tries to predict.
I had proposed basically the same solution about a month ago, and it's great that the issue is being addressed so soon after it first appeared. If any of the people who had written to me/commented on the topic get their new updated results and want to comment, feel free to do so in this post.

I am not sure how 23andMe plans to handle their Ancestry Composition feature in the future, but I would suggest that they periodically re-update it as they get more samples. According to a recent estimate, there are over 180,000 people in their database at the moment, a fraction of which meets the twin requirements of: (i) having 4 grandparents from the same country, and (ii) not being an outlier. As this number increases over time, it might be a good idea to occasionally re-partition the sample and re-calculate participants' ancestry composition results.

The fact that they are ready to roll out their updated results so soon after the initial ones tells me that they do have the computing power to do so, and it might be a good idea to update Ancestry Composition periodically, say on a quarterly basis or when a certain increase in the training set (say, 10%) is achieved. Eventually the admixture estimates may stabilize, in which case the way forward may involve rethinking the choice of ancestral populations currently in use.

December 10, 2012

How to fix 23andMe's Ancestry Composition

There have been enough public reports by now that reinforce my initial suggestion that Ancestry Composition overfits to the training data (aka people with four grandparents from a reference population who filled their ancestry survey). The result of this is that such people get 99-100% of their ancestry assigned to a particular population, and the test essentially returns the customer-supplied population label instead of returning the person's ancestry based on his actual DNA.

Now, this is not a problem for the majority of 23andMe customers who don't have 4 grandparents from the same country, or have 4 grandparents from a colonial country such as the United States.

But, the problem for the rest of the 23andMe community cannot be overlooked, because it is significant for people from non-colonial countries who make up the reference populations. Ironically, the people who are actually making this type of analysis possible (people who dutifully filled in their ancestry survey) are the ones getting the raw end of the deal.

I have seen talk of people retracting their ancestry survey answers in the hope of getting some accurate results! I don't think that's the way to go, as that would lead to a race to the bottom: people might retract or change their ancestry survey answers in the hope of improving their results, but, if enough people do this, the training sample will be shrunk and distorted, so the results will be worse for everybody!

How to solve the problem

In a world with infinite computing resources, and a large number of samples, the problem could be solved optimally by leaving out each of N training samples, rebuilding the ancestry predictive model, using the remaining N-1 samples, N times, once for each training sample, and then applying it to each of these left out samples.

Naturally, this would have the effect of increasing the computational complexity of ancestry estimation approximately N-fold, so it does not seem practical.

An alternative approach would be to build the model only once (using all N training samples) and incrementally update it for each training individual. This depends on the feasibility of such an incremental update which would incur a minor cost per individual -to adapt parameters of the model by "virtually" taking out the individual. My suspicion is that it will be extremely difficult to do this type of incremental update for the fairly complex model used by 23andMe in their Ancestry Composition.

So, what would be a practical solution?

Partition the N training samples into a number of G groups, each of which will have N/G individuals. Now, rebuild the model G times, each time using N-N/G individuals, i.e., leaving one group out. Note that the initially proposed solution (i.e., leaving one out) is a special case of the above with G=1.

The computational cost of this solution will be something less than G times the cost of building the full model with all N training samples. This is due to the fact that you are building the model G times, but over a slightly smaller dataset (of N-N/G individuals).

Practically, G=10 would be reasonable number of groups, which would, however, require the model to be built ten times. Whether or not this is practical for 23andMe, I don't know, but since they have to periodically update their model, I think that they ought to try this approach. If they already have idle CPU cycles, that's a great way to occupy them, and if they don't, then investing in processing power would be a good idea.

December 07, 2012

23andMe Ancestry Composition

23andMe has launched its new Ancestry Composition feature, the workings of which are summarized -at a very high level- in this page.

I have already received some feedback from customers who also happen to be part of my Dodecad Project and who appear to be perplexed by their results. It is unfortunate that my own rules preclude me from discussing the details of these reports. I encourage people who want to discuss their ancestry composition to do so in the comments.

Without going into details, I would first advise that 23andMe make transparent the way in which 23andMe participants were selected as part of their training data. This is explained in their writeup with the following paragraph:
Most of the reference dataset comes from 23andMe members just like you. When someone tells us that they have four grandparents all born in the same country, and the country isn't a colonial nation like the US, Canada or Australia, they become candidates for inclusion in the reference dataset. We filter out all but one of any set of closely-related people, since they can distort the results. And we remove "outliers," people whose genetic ancestry doesn't seem to match up with their survey answers.
23andMe takes a "birthplace of grandparents" approach rather than an "ethnic origin" approach. This may be reasonable when the two tend to coincide but not appropriate at all when ethnic groups of different origins co-exist in a given territory. Contrary to the implicit belief expressed in the above paragraph, ethnic complexity is not limited to "colonial nations", and an approach that disregards ethnicity, language, and religion, and limits itself to "birthplace of grandparents" is bound to miss it.

The problem with supervised learning is that the end product is only as good as the labels. If the labels aren't good, or they're ambiguous, then you end up with a mess.

Let's take an example of an individual who reports "4 grandparents from Turkey." This may mean anything ranging from a Mesopotamian Kurd within the boundaries of Turkey, a Central Anatolian Turk, a Cappadocian Greek, a Turkocretan, an Armenian from Cilicia, an ethnic Greek from European Turkey, or a Turkish-speaking Muslim from Skopje or Bulgaria. Some of these may interpret "Turkey" geographically; others ethnically. The label "Turkey" is polysemous, for a variety of reasons: it can be interpreted either geographically or ethnically, and in both these senses it has not been time-invariant.

I don't know how 23andMe built their reference populations, but I am ~100% sure that 4 grandparents from Turkey = "Middle Eastern" in their terminology. I am also fairly sure that their "Balkan" sample consists of individuals as different as Croats and Greeks. So what do these meta-population labels mean? Your guess is as good as mine: a balance of samples of different origins and different interpretations of these origins in whatever training set 23andMe assembled.

In my own project, I never include a priori labels of individuals in the inference of ancestral components. I deal with genotypes and individuals, not self-reported ancestral origins and labelled sets of individuals (populations). Components emerge from unsupervised learning over a set of individual genotypes, and it is only a posteriori that labels are assigned to the inferred components, by observation. Indeed, one could forego the assignment of labels altogether!

My amicable advice to 23andMe is to drop supervised learning altogether. It will only get worse as new customers (aka new test data) join in.

December 06, 2012

Sneak peek at new version of 23andMe ancestry analysis via Jeff Probst

You can watch a 9min clip here.

Paternal haplogroup ("traces to France and Germany"):


Anyone care to speculate what that is? The foci in eastern India and absence in parts of NW Europe and the Balkans throw me off.

And what of his maternal haplogroup ("Northern Africa" "pastoralists" "Berbers"):


I would have guessed U6 or M1, but the focus east of the Caspian throws me off again. 23andMe may have potentially very large sample sizes, so perhaps their frequency maps may be even better than ones published in the literature, so I'm genuinely curious what this might be.

Various other info: his dad "top 1% Neandertal", no evidence of "Asian" or "Jewish".

Anyway, onto the main course, i.e., the new Ancestry composition:


One thing that I like about this is the assignment of a portion of ancestry to a "Nonspecific Northern European" group, which is a feature I haven't seen before. I am told that this feature will launch very soon, so it will be interesting to see how well it works across many individuals.

April 04, 2012

Cryptic distant relatives are common in genetic samples

Table S1 has some statistics on different populations.

PLoS ONE 7(4): e34267. doi:10.1371/journal.pone.0034267

Cryptic Distant Relatives Are Common in Both Isolated and Cosmopolitan Genetic Samples

Brenna M. Henn et al.

Although a few hundred single nucleotide polymorphisms (SNPs) suffice to infer close familial relationships, high density genome-wide SNP data make possible the inference of more distant relationships such as 2nd to 9th cousinships. In order to characterize the relationship between genetic similarity and degree of kinship given a timeframe of 100–300 years, we analyzed the sharing of DNA inferred to be identical by descent (IBD) in a subset of individuals from the 23andMe customer database (n = 22,757) and from the Human Genome Diversity Panel (HGDP-CEPH, n = 952). With data from 121 populations, we show that the average amount of DNA shared IBD in most ethnolinguistically-defined populations, for example Native American groups, Finns and Ashkenazi Jews, differs from continentally-defined populations by several orders of magnitude. Via extensive pedigree-based simulations, we determined bounds for predicted degrees of relationship given the amount of genomic IBD sharing in both endogamous and ‘unrelated’ population samples. Using these bounds as a guide, we detected tens of thousands of 2nd to 9th degree cousin pairs within a heterogenous set of 5,000 Europeans. The ubiquity of distant relatives, detected via IBD segments, in both ethnolinguistic populations and in large ‘unrelated’ populations samples has important implications for genetic genealogy, forensics and genotype/phenotype mapping studies.

Link

March 16, 2012

Neandertal/Denisovan admixture using PCA

Eric Durand at 23andMe developed a test of Neandertal ancestry based on PCA. The main idea is to map modern human samples onto a PCA space of Chimpanzee, Neandertal, and Denisova, and see if there are any shifts in direction towards the two archaic hominins.

I tried my hand at replicating this experiment using the Harvard HGDP set. In particular I used the panel4 of SNPs, ascertained on a San individual. Below is the broad-view PCA:

blue (top left) = Neandertal
green (bottom left) = Denisova
red (right) = Chimp

Now, here is a blowup of the middle portion of the figure, on human populations.

As expected, Eurasians deviate from Africans in a Neandertal direction, while, Papuans deviate from Eurasians in a Denisova direction.

December 19, 2011

Neandertal admixture: why I remain skeptical

The announcement by 23andMe of a Neandertal admixture feature of their commercial test gives me an opportunity to revisit the question of Neandertal admixture in general. At the outset, let me state that I'm still on the fence on whether there has been such admixture in Eurasians. The evidence that has appeared since the publication of Green et al. (2010) provides arguments both in favor and against the Neandertal admixture hypothesis.

Let's begin by examining the case for Neandertal introgression into Eurasians. This case boils down to the fact that modern Eurasians are more similar to Neandertals than modern Africans are. If Neandertals were an irrelevant outgroup, this is an unexpected finding.

The above statement in bold is the fact. But, this fact does not admit to the single interpretation of Neandertal admixture in the ancestors of Eurasians.

At the very beginning, I suggested that this fact could be explained by archaic admixture in Africans. Both genetics and paleoanthropology has furnished evidence in favor of my idea. It is no longer tenable to propose that Eurasians are shifted towards Neandertals only because of Neandertal admixture: in fact some of the shift may be due to Africans being shifted away from Neandertals because of admixture with archaic African hominins. Any future work on the issue must take this possibility into account.

A different pitfall is in the direction of gene flow: whether Neandertals donated genes to the Eurasian gene pool or vice versa. Again, I have contended that it is more likely for a successful expanding species to donate to a contracting species, rather than opposite. However, Green et al. proposed an ingenuous argument against that direction of gene flow:

The main idea is the following:

- Yoruba Nigerians are closer to Eurasians than San are.
- If the Neandertal genome is Proto-Eurasian-admixed, then it should be shifted towards Yoruba relative to San
- It does not appear to be, hence, on balance, gene flow was from Neandertals to modern humans, rather than the opposite.

The idea is fleshed out in the supplement of the Green et al. paper. It exploits the fact that modern human populations are not equidistant to each other, to show that an archaic hominin that was admixed with a subset of modern humans (Eurasians) would not only be shifted towards that population, but would also appear closer to populations close to Eurasians (=Yoruba), rather to those who are not (=San).

All this depends, of course, on the idea that the people who interbred with Neandertals were Proto-Eurasians, i.e., a subset of Africans who left the continent and went on to become modern Eurasians.


This idea is not as secure as it formerly appeared to be. The recognition of the real possibility of Out-of-Arabia means that the people who admixed with Neandertals may not have been Proto-Eurasians, but, rather undifferentiated Proto-Humans. In other words, they were not necessarily closer to modern Eurasians than to modern Africans, but, rather, common ancestors to both.

In conclusion:
  • The inference of Neandertal admixture in modern Eurasians in terms of the D-statistic is proven to be a simplification that ignores archaic admixture in Africa
  • The inference of Neandertal-to-modern admixture is based on the assumption that moderns admixing with Neandertals were already Eurasian-like, but the mounting evidence for a major human expansion Out-of-Arabia may mean that they were not. 
Many mysteries about human origins will be solved thanks to the advent of full genome sequencing. Hammer et al. found archaic admixture in Africans on just 61 genomic regions, each about ~20kb in length.

I'm willing to bet that once scientists turn their attentions to full genomes, they will have substantial and indisputable evidence for genetic divergence between stretches of human DNA that simply too deep to be explained in a conventional Out-of-Africa timeframe.

If there was substantial archaic admixture in Africa c. 35ka, according to Hammer et al.'s estimate, and coinciding with the (intrusive?) appearance of Upper Paleolithic modern humans such as Hofmeyr, then full genome sequencing will provide the smoking gun evidence for it. Such an event would simultaneously solve many mysteries about the African population, such as its apparent higher effective population size, greater allele diversity, and recombination rate.

It may very well be that some level of Neandertal admixture will remain part of the story. We shall see.

October 23, 2011

Citizen genetics

Razib posted his analysis of the 23andMe data of Betsileo individual from Madagascar. This analysis was made possible by a number of different people:
  1. Razib, who took the initiative and carried out the analysis
  2. Scientists who wrote the software used
  3. 23andMe, who sold a product providing genotype data
  4. Donors, who paid for the test
  5. The actual individual who contributed his/her DNA for science
Yesterday, I twitted in exasperation that Otzi's genome, which must have been available in at least some sort of draft form since at least the beginning of this year, has been under lock and key, presumably because of the need to make a big splash with the simultaneous Bolzano conference, TV special, likely imminent journal publication, and all the media stories that will follow.

Scientific progress = active brains using tools to examine data and produce new knowledge

In the grand scheme of things it may be a trifle that the Betsileo are Bantu+Malay. But, the fact that ordinary people can band together and produce new knowledge within a few months is anything but a trifle. An equivalent academic study would have taken years: drafting proposals, dealing with funding agencies, getting consent forms, navigating institutional review boards, dealing with bureaucrats, convincing reviewers and editors, and finally producing an article that might end up hidden behind a paywall for the profit of some publishing company.

Would the end product be better? Perhaps, but the whole point of citizen science is that you can do it better, and you can tell the whole world if it's bad.

Citizen science is no longer a sideshow, and traditional science must take her into account, lest she be reduced to a sideshow before long.

There are, of course, things that citizen science can't, or won't do.

It won't do the kind of meaningless navel-gazing work that scientists in some disciplines are able to get away with at the public expense. It won't inveigle in the hope of being printed in the pages of a high-status publication. It won't split and bundle itself into least publishable units. It won't sit in a drawer afraid of having its data and ideas co-opted by the competition.

But, it can't do the type of highly sophisticated, technical and expensive work that requires a concerted interdisciplinary effort and substantial human and monetary resources.

There is plenty of common ground between traditional and citizen science, so let's hope that their noble competition and ability to learn from each other will benefit the lady Science herself.

November 24, 2010

23andMe $99 sale

Just an alert for anyone that might be interested (it will bring me more samples for the Dodecad Project). There doesn't seem to be an official announcement, but I've tested it, and it seems to work as of this writing. It's $99+shipping+1 year of mandatory Personal Genome Service at $5/month.

I won't be ordering myself, however. The price is not bad, but the mandatory $5/month for a year subscription to a Personal Genome Service is not my cup of tea.

I am all for choice, and for people being able to choose what they want. Personally, I want a few hundred thousand SNPs without any of the trimmings.

I don't want some intermediary to provide me with Relative Finder matches until I stop paying them a fee. In fact, I'm not interested in finding relatives at all, I already know who they are. I am not interested in personalized health reports, because, frankly, the results of health information you can get from a personal genome test is tiny. I am not interested in 23andMe's ancestry analysis, because mine and that produced by other dilettantes is, frankly, more cutting edge.

So, hopefully, a company will realize that there's money to be made by people who want to get their DNA genotyped and interpret it themselves using freeware community resources and networking. Until that happens, feel free to use every available option, including the great 23andMe sale, but count me out.

October 11, 2010

Running EURO-DNA-CALC on GenomesUnzipped

(Last Update Oct 23)

genomesunzipped is a new initiative to put data of personal genomics customers online. It's a great idea, and the data will be quite useful to many people.

I downloaded the available data and ran EURO-DNA-CALC on them. Of course it is meant to be used for European or West Eurasian people, which all of them seem to be.

Here are the results for the 12 people whose data were online as of this writing. In bold are components whose confidence intervals do not intersect 0.


Most of them seem to be of NW European descent as their names suggest, and a couple seem to be partly or significantly of Jewish descent.

If you are one of these people, feel free to write to me or leave a comment to tell me if I'm right or dead wrong!

UPDATE (Oct 14): A more detailed analysis of DBV001 and VXP001 in this post.
UPDATE (Oct 23): A much more detailed analysis of all Genomes Unzipped individuals in the context of western Eurasia.
UPDATE (Nov 1): Joe Pickrell discovers Jewish great-grandparent

October 06, 2010

Eurasian ADMIXTURE (a precursor to Eurasian-DNA-Calc?)

(Last Update: Oct 8; K=7 added)

I took the 540,814 markers from the HGDP dataset that are also included in the 23andMe personal genomics test, and that have less than 1% no-call rate.

I ran ADMIXTURE on all the West- (and some mainly Caucasoid Central-) Eurasian populations, including Yoruba and Han Chinese to account for non-Caucasoid admixture in parts of Eurasia.

The populations are (left-to-right): Tuscan, North Italian, Sardinian, French, French Basque, Orcadian, Russian, Adygei, Palestinian, Bedouin, Druze, Mozabite, Pathan, Sindhi, Balochi, Brahui, Burusho, Yoruba, Han Chinese.

Here are the admixture proportions corresponding to this experiment:

This seems like a good starting point for the new EURASIAN-DNA-CALC I have in the works.

Relative to the existing EURO-DNA-CALC, doubling the number of ancestral populations (from 3 to 6), and increasing the number of SNPs (by 3 orders of magnitude) introduces some obvious computational problems. I have some ideas on how to resolve them, so stay tuned.

APPENDIX

For the sake of completeness here are the ADMIXTURE runs for K=3 to K=5.

At K=3, the three major races (Caucasoid: green, Mongoloid: red, Negroid: blue) emerge.
At K=4, the Caucasoids are split into West Eurasians (red) and Central Asians (purple)
At K=5, the West Eurasians are split into Europeans (yellow) and West Asians (blue)
PS: I will probably do some ADMIXTURE runs for K=7 and higher in the next few days; the results will be posted in this blog post as an update.

UPDATE (K=7)
The Druze get their own cluster (pink) with an average membership of 65.4% of Druze individuals

November 02, 2008

23andme's advanced global similarity tool

UPDATE: I am told that this tool is currently in alpha version, so it's not clear when it will be fully ready for 23andme customers. As per my comments below, I think this is a great initiative to tie individual customers' genetic data to the many new genetic studies showing genomic-geographic correlations. I am sure that 23andme's blog, the Spittoon, will cover this when it is ready for public release, including any features that I may have overlooked. I will be following this story closely. [end update]

23andme has added a new advanced global similarity tool to their website (you need to register in order to play with it). This tool places a customer, as well as other customers he is "connected" with on the map of the first two principal components like the ones recently published in several papers.

The tools allows one to look at the PC map at the global, continental, or subcontinental level.


This is quite useful, and a right step in the direction I pointed out earlier. However, there are some points of criticism.
  • The axes are labeled North/South Migration and East/West Migration. While the pattern in the first two principal components does correspond roughly with longitude and latitude, it is erroneous to label these principal components as "North/South" and "East/West". It is even more erroneous to label them as "Migration", since a geographical cline is not necessarily produced by a migration event.
  • The "Take a Tour" feature presents a simplistic and misleading account of human prehistory in terms of "migrations". This account is a simple branching pattern, e.g., Africa -> Near East Europe, or Africa -> Near East -> Central Asia -> East Asia. The observed pattern did not emerge in this manner. For example, Central Asian people such as the Uyghur are intermediate between Western Eurasians (Caucasoids) and Eastern Eurasians (Mongoloids) because of a later admixture event; they can't be thought of as "ancestors" of the East Eurasians.
  • Partitioning human variation into this hierarchical set of groups is not the best way to satisfy customers' needs. For example, a Hispanic person may wish to see himself on a PC map which includes "Southern European" and "Native American" groups, an African American person may wish to see himself on a PC map which includes "Northern European" and "West African" groups, an Ethiopian, on a Sub-Saharan/Near Eastern map, while a European Jew on a European/Near Eastern map. Of course, there is a combinatorial number of possible combinations, but there is no reason why some of the more common ones (customer feedback may play a role here) many not be supported.
  • Why should this tool be limited to the first two principal components? Of course, additional components do not have such a strong geographical correspondence, but they -nonetheless- will separate populations in different ways, and allow individuals to place themselves more fully in context.
  • The tool could offer much more information. On mouse hover over an individual, a small label identifying it (e.g. origin and HGDP code), and listing its PC coordinates could appear. This is especially useful for power users. A pretty uncluttered picture is no substitute for as much information as possible.

September 16, 2008

Why I won't be testing with 23andMe (yet)

Since the announcement of the price reduction of the 23andMe service, I had a couple of readers e-mail me, asking whether I thought it was worth it. My response was to do their own research to figure out whether the information they would learn would be worth the asking price.

Here is why, despite the much lowered price, I am not convinced.

What is a 23andMe test worth to you?

The 23andMe test gives you four types of information:
The frivolous kind is perhaps a good and safe topic for conversation (as opposed, e.g., to Schizophrenia) but its value is obviously not $400.

The obvious kind tells you something that you already know about yourself; it only confirms that you have the genes for your known phenotype. Again, this is not useful information, unless perhaps your spouse orders a test too: then, you could sometimes calculate the chances of your children presenting a phenotype for simple quasi-Mendelian traits. 

The medical kind is potentially useful. It is for this kind of information that you should research the traits on offer, to see if (i) they interest you, and (ii) there are strong associations for them. The truth is, however, as I pointed out recently that for all their hype, genome-wide association studies have produced meagre results. Your family medical history and lifestyle are much more likely to affect your health than the common variants genotyped in the current generation of microarrays. But, by all means, do your own research, since a trait that interests you may be one for which strong associations have been found.

Finally, the ancestral information is potentially useful, since recent papers have shown that multi-100K SNP genome-wide testing can produce fine-scale ancestral inference, even in relative homogeneous populations such as Europeans. However, as of this writing, you won't get this type of information, but rather a simple 4-group global admixture estimate, as well as a measure of your similarity to the Human Genome Diversity Panel populations. For anyone whose known genealogical ancestry isn't complex, this information will not be novel.

23andMe have also apparently included some fairly detailed SNP typing that place you in the Y-chromosome (if you are male) and mtDNA phylogeny, giving you a detailed haplogroup assessment. Even if this information wasn't known to me, this part of the test would not be tempting, since haplogroup assignment reveals only very coarse information. Almost certainly, you would want to follow up with a Y-STR panel or mtDNA full-genome sequence, which would allow you to meaningfully compare yourself with the many others that have tested so far; but that is not an option with 23andMe.

What is a 23andMe test worth to 23andMe?

23andMe have definitely emphasized the "coolness" factor of getting DNA tested. It has definitely generated a lot of publicity, and getting tested is marketed as a social prestige marker. 

However, the "coolness" factor isn't enough to sell tests. It is not known how many have been sold so far, but there are indications that not many is close to the truth:
  • No one cuts the price by 60% if there is solid demand for a product. Prices for Y-STR tests haven't dropped by nearly as much in 5+ years, since there is a real demand for them by genetic genealogists.
  • People have a Gattaca-view of genetic testing, and are afraid of the information it may reveal. It has been said that 23andMe has downplayed the significance of the associations they report to keep regulators off their backs, but a more significant reason may be to entice reluctant consumers to test.

One would think that a company selling genetic tests would stress the importance of the genetic information, but this is clearly not the case. To its credit, 23andMe doesn't oversell the actual science, which as I mentioned above, isn't all that revealing. 

This NY Times story gives a useful hint:
Ms. Wojcicki and Linda Avey, the company’s other founder, say their chief goal is to advance science by compiling a database of genetic information that medical researchers can tap (while protecting customers’ anonymity). Customers cannot opt out of having their information anonymously shared, but they can refuse to participate in surveys focusing on specific traits.
This is also why the Coriell Institute is offering genotyping for free

It is simply a problem of body count. Progress has been hampered by small sample sizes. Collecting and genotyping large samples is hampered both by strict ethics rules that publically-funded medical researchers must adhere to, and also by the actual cost of sample-collection and genotyping.

23andMe's goal isn't to be a service provider to consumers. Their goal is to entice regular people to pay the cost of genotyping, while at the same time helping it build its huge "database of genetic information."

Just as a TV station doesn't exist for providing news and entertainment to an audience, but provides news and entertainment so it can sell ads and make money, so 23andMe doesn't exist to sell genetic information to consumers, but sells genetic information to build and exploit its genetic database.

What it would take to make me a customer

There is nothing wrong with 23andMe's strategy: its goal is to advance science (and make money doing it) by offering genetic information to consumers. This is a potentially win-win situation for both. However, as I explained above, the information one can expect from the test today isn't all that special.

Since I don't want to be a naysayer, here is what it would take to convince me to test with 23andMe:

-- Fine-scale ancestry analysis --

The science has progressed to a point where this is possible. We now know that distinct subgroups can be discerned within all the major continental races. The latest research revealed an additional insight: it doesn't take a large samples to place individuals on the map fairly accurately. Even countries with less than 5 sampled individuals found their way to the "correct" geographical spot. 

The corrolary of the above observation is that a global-level sampling of human genetic diversity can be done at a reasonable cost. Leveraging existing samples, and adding mini-samples from many unstudied locations can lead to powerful ancestry analysis, which can be gradually and continuously refined as more samples are collected (both on the "field" and using customers of unmixed localized heritage)

23andMe's partnership with ancestry.com suggests that they take this segment of their market seriously. So, when and if they offer a test which provides ancestral information beyond the obvious, I will be tempted to test with them.

September 09, 2008

Price reduction of 23andme service

The Spittoon blogs about a great reduction in the price of the 23andme service.
With the introduction of v2, our next-generation analytical platform, 23andMe customers will have access to an even more powerful set of SNPs we use to probe their unique genetic composition. And thanks to advances by Illumina, the provider of our genetic analysis technology, that information will now be available at the reduced price of $399.
I am skeptical that Illumina has made any great "advances" in the less than one year since the launch of the 23andme service. From the press release:
The price reduction is largely made possible following technological advancements by Illumina — a leading provider of genetic analysis technologies — to its DNA Analysis Beadchips, which 23andMe uses to genotype customers. The new Beadchip, called the HumanHap550-Quad+, makes use of a four-sample format. 23andMe also has added improved custom content to the new Beadchip, which will include a broader range of Single Nucleotide Polymorphism (SNP) variations and rare mutations not found on the previous Beadchip, thereby providing more relevant data on published associations, as well as maternal and paternal ancestry.
I think that this is part of a strategy (by either Illumina to sell more chips, or 23andme to sell more tests, or both). The previous $1,000 price tag made the test pretty much a luxury item, as it reveals generally mild associations, which are important at the population level, but not so much for the individual. The new pricing probably aims to tempt more "regular people" to the personal genomics market.

23andme has put a lot of emphasis on 23andwe, which aims to discover new associations in the company's customers by correlating the results of "customer-surveys" with the genomic data. 23andwe is probably very important in the long-term as a source of revenue, since the discovered associations may have commercial utility. But, to discover them, the company needs to expand its customer base, so that statistically significant associations will emerge.

Market economics dictates that more customers will now opt for the test, and I hope some of them will try my EURO-DNA-CALC.

June 21, 2008

EURO-DNA-CALC 1.1 released

UPDATE (20 Feb 2011): EURO-DNA-CALC has been superseded by the Dodecad Ancestry Project.

The current version is 1.1.1 Download it if you obtained previous version (1.1) which had a bug for deCODEme data.

A new version of the EURO-DNA-CALC has been released. The previous version (1.0) was described here.

Version 1.1 also uses deCODEme/23andMe data for an individual but outputs a maximum likelihood estimate of his admixture proportions from the three groups NW European, SE European, and Ashkenazi Jewish.

More comments on this dna-forums thread (registration required).

Below are the admixture estimates for Greg and Lily Mendel of 23andMe. An examination of the confidence intervals output by the program suggests that non-NW European admixture in Greg is a strong possibility, while Lily could conceivably have no non-NW European admixture. On the other hand, Lily's 10% Ashkenazi and only 1% SE European admixture imbalance is interesting. As with all genetic tools, the results should be interpreted in the light of other information about a person's ethnic origin and genealogy.

June 16, 2008

EURO-DNA-CALC 1.0 released

Get the newer version 1.1 here.

I have created a version of my classifier for 23andMe/deCODEme genotype data that I talked about before. It calculates the probability that an individual is "Northwestern European", "Southeastern European", or "Ashkenazi Jewish".

You can download it here.

If you use it, feel free to leave a comment or send me an e-mail.

Comments from some users who tried it can also be found in this dna-forums thread (registration required).

(clarification, June 17): This is not an admixture test, but a "guess his/her origin out of these three groups" test, a classifier. Here is a way to look at it: the classifier compares these three events:

How likely is this genotype to be NW European?
How likely is this genotype to be SE European?
How likely is this genotype to be Ashkenazi Jewish?

An admixture test compares the likelihood of all possible events:

How likely is this genotype to be (x% NWE, y% SEE, z% AJ), for all x,y,z>0 such as x+y+z=100

Admixture analysis is a harder task than classification, because it looks at a much bigger realm of possibilities (all x,y,z combinations). That is why people have been getting generally accurate assessment of their main ancestry component from various testing services over the years, but sprinkled with widely varying and often puzzling estimates of minor admixture.