brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

known mouse sites

igordot opened this issue · comments

commented

This is probably more of a question for other users rather than the developers. There are known polymorphic sites provided for human hg19/hg38 genome. Is there a version available for mouse sites? Since that is likely not the case, is there somewhere where one can get a population VCF with all the necessary info present for find-sites?

Hi, did you try dbsnp? If that doesn't work, I can try to help update find-sites so that it will.

commented

I tried dbSNP (ftp://ftp.ncbi.nih.gov/snp/organisms/archive/mouse_10090/VCF/00-All.vcf.gz) and EVA RefSNP (ftp://ftp.ebi.ac.uk/pub/databases/eva/rs_releases/release_2/by_species/mus_musculus/GRCm38.p4/GCA_000001635.6_current_ids.vcf.gz).

It runs without errors, but also without results:

somalier version: 0.2.15
on chrom:1
on chrom:10
on chrom:11
on chrom:12
on chrom:13
on chrom:14
on chrom:15
on chrom:16
on chrom:17
on chrom:18
on chrom:19
on chrom:2
on chrom:3
on chrom:4
on chrom:5
on chrom:6
on chrom:7
on chrom:8
on chrom:9
on chrom:MT
on chrom:X
on chrom:Y
0 candidate variants
sorted and filtered to 0 variants. now dropping INFOs and writing
[somalier] wrote 0 variants to:sites.vcf.gz

looks like it's requiring AF right now. You could post-process the 00-All.vcf.gz to add AF from the bitfield as here: https://www.biostars.org/p/3877/#107953
The other VCF that you link doesn't have AF encoded or otherwise, so it's probably not a good one to use.

Maybe you could also use this: https://ftp.ncbi.nih.gov/snp/organisms/archive/mouse_10090/VCF/genotype/SC_MOUSE_GENOMES.genotype.vcf.gz
and add AF from the lines in the genotypes.

The relatedness calculation won't work very well without heterozygotes, but you should still see clear clustering of samples based on IBS0 and IBS2.

commented

Thank you for following up. That's an interesting idea about combining with the genotypes VCF.

I don't usually deal with VCFs and especially writing them. Do you know if there is an easier way of doing that rather than just parsing each line manually?