brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issue extracting variants with VCF

holtjma opened this issue · comments

Hello,

I was testing using somalier on a new beta pipeline for long-reads and encountered a particular issue I can't explain. The data we're using is PacBio HiFi and we're currently using the Sentieon DNAScope for variant calling. This produces a VCF file (specifically, not gVCF) so I knew we would need to use --unknown for somalier. However, known relationships between GIAB samples weren't matching.

After looking into the logs, we're getting far fewer variants extract than I would expect:

[somalier] found 1314 sites

In comparable VCF for the same sample, we get well over 10k extracted.

The sample is passing benchmarking with flying colors, and I did a quick bcftools isec between the VCF and the sites files and got 10850 variants matching, which is approximately what I would expect.

So the long story short is that I'm at a loss as to why somalier seems to be missing a bunch of variants. Happy to share one of the VCF files somehow (I think I have an email from previous issues somewhere...) if that's the best path forward.

Just another note, it seems like if I run somalier directly off of the BAM files, I get the expected results downstream. So it seems to be a disconnect between the specific VCF format and somalier (aka, I'm pretty sure this is not a sample issue).

Yeah, I think I should document so that bam/cram is always preferred. If that's not possible, then GVCF and if a multi-sample VCF is available, that will work great (for that cohort against itself).
I tried to properly support when not enough info is available, but it's just too error prone.

If it has lower depth at those sites, they could be excluded.
But I just added a note to the readme about this.

Ah okay, if BAM is recommended then I should probably just go with that approach here.

As for the VCF, seems like it doesn't have an AD field, so I don't think it would exclude there?

Closing because of BAM workaround; issue with VCF is unresolved but not particularly relevant anymore