23andMe / yhaplo

Identifying Y-chromosome haplogroups in arbitrarily large samples of sequenced or genotyped men

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Updating ISOGG reference

teepean opened this issue · comments

Hello!

I tried updating ISOGG database to the latest one but get a lot of errors caused by conflicting SNPs. What should be done about those and how to determine which one should be used?

An example:

ERROR! Conlicting SNPs:
FGC29577 I2a1b2a 15472184 C->G
Y10712 I2a1b2a 15472184 G->G

In database:

FGC29577	I2a1b2a	Y10718	rs7892998	15472184	C->G
Y10712	I2a1b2a	FGC29614	rs7892998	15472184	G->G

The ISOGG database is a great resource, but it's just a starting point. I imagine it would be a pretty big job to validate the SNPs that have been added since the snapshot currently used by yhaplo. You could use 1000 Genomes data to start. For example, you could identify 1000 Genomes lineages carrying other SNPs in the clade of interest and then assess the allelic distribution for the SNPs of interest. But of course this will only cover SNPs on lineages present in 1000 Genomes.

In your specific example, it's pretty clear that G->G is not a valid "acestral->derived" mutation, so you could either remove that line from your input file or add a line to a blacklist input file (input/isogg.omit.*.txt).

The ISOGG version yhaplo currently uses should be sufficient to classify haplogroups to a pretty deep granularity. If you needed greater resolution for any particular subclade of interest, you could build a phylogeny from your sequences. See these references for more details:
http://science.sciencemag.org/content/341/6145/562
https://www.nature.com/articles/ng.3559

Hope that helps.