Updating ISOGG reference
teepean opened this issue · comments
Hello!
I tried updating ISOGG database to the latest one but get a lot of errors caused by conflicting SNPs. What should be done about those and how to determine which one should be used?
An example:
ERROR! Conlicting SNPs:
FGC29577 I2a1b2a 15472184 C->G
Y10712 I2a1b2a 15472184 G->G
In database:
FGC29577 I2a1b2a Y10718 rs7892998 15472184 C->G
Y10712 I2a1b2a FGC29614 rs7892998 15472184 G->G
The ISOGG database is a great resource, but it's just a starting point. I imagine it would be a pretty big job to validate the SNPs that have been added since the snapshot currently used by yhaplo
. You could use 1000 Genomes data to start. For example, you could identify 1000 Genomes lineages carrying other SNPs in the clade of interest and then assess the allelic distribution for the SNPs of interest. But of course this will only cover SNPs on lineages present in 1000 Genomes.
In your specific example, it's pretty clear that G->G
is not a valid "acestral->derived" mutation, so you could either remove that line from your input file or add a line to a blacklist input file (input/isogg.omit.*.txt
).
The ISOGG version yhaplo
currently uses should be sufficient to classify haplogroups to a pretty deep granularity. If you needed greater resolution for any particular subclade of interest, you could build a phylogeny from your sequences. See these references for more details:
http://science.sciencemag.org/content/341/6145/562
https://www.nature.com/articles/ng.3559
Hope that helps.