brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sites file for metylation data

GWW opened this issue · comments

commented

Hi,

I was examining the hg38 sites file provided from the downloads and noticed a few issues when using the file with methylation data.

  1. There are a small number of reference sites with a cytosine in the reference (N = 142). From the manuscript it mentioned that these sites were supposed to be excluded.

  2. There are a bunch of guanosine bases in a CpG context (N = 2,713), these could be problematic as the reverse stand can be methylated.

I am not sure if these will be problematic for downstream sample matching or if I should remove the sites from the vcf file.

Thanks so much

Hi, yes, those are sites that were NOT C reference in hg19, but in hg38 became C reference.
There are only 42 of these:

zgrep -v ^# sites.hg38.vcf.gz  | awk '$4 == "C"' | grep -Pv "(chrX|chrY)"

note that C alleles on the X and Y chromosomes are not used for relatedness, only for sex QC.

I suspect that it wouldn't make much difference to remove these alleles, but would be interested in your findings.

There are a bunch of guanosine bases in a CpG context (N = 2,713), these could be problematic as the reverse stand can be methylated.

I don't think this is an issue as the evaluation is always done on the forward strand, but again, would be interested to note the differences if you try this on your data.

commented

Thanks so much for the quick reply. I will certainly try both and let you know if I find some differences. I suspect there won't be any if you are only using the forward strand.