brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error with find-sites

ashotmarg opened this issue · comments

Hi Brent,

Thanks for Somalier, it's quite fast and straightforward to use!
At the moment I am also trying to extract sites more tailored to my dataset with "find-sites". While it worked fine for one of my vcf files, it failed for another, which was produced from a different pipeline with slightly different INFO and FORMAT fields.

For the failed one I am getting the error below: I am running somalier from it's singularity image and this is just a chr21 extract of my main vcf.


$ singularity exec ~/bin/somalier.sif somalier find-sites --min-AN=2 --min-AF=0.2 --snp-dist=10 test.vcf
INFO: Converting SIF file to temporary sandbox...
somalier version: 0.2.15
on chrom:chr21
290 candidate variants
tables.nim(262) []
Error: unhandled exception: key not found: chr21 [KeyError]
INFO: Cleaning up image...


Could you elaborate for which fields in the vcf is "find-sites" looking for? My vcf seems to have the relevant ones such as AC, AF, AN...
E.g., INFO and FORMAT fields:
AC=2;AF=0.333;AN=6;DP=16;ExcessHet=1.0474;FS=0;MLEAC=1;MLEAF=0.167;QD=22;SOR=2.303 GT:AD:DP:GQ:PL

Thanks
Ashot

Hi, thanks for reporting. Will you give it a try with this binary? It will give us more information about exactly where the error is occuring

somalier_dbg.gz

My apologies. I didn't overwrite a previous somalier_dbg I had. Please try this one.
somalier_dbg.gz

Thanks for the prompt feedback! Here is what I get with the whole vcf file.

$ ./somalier_dbg find-sites batch1.Sentieon.biSNPs.vcf.gz
somalier version: 0.2.15
on chrom:chr1
on chrom:chr2
on chrom:chr3
on chrom:chr4
on chrom:chr5
on chrom:chr6
on chrom:chr7
on chrom:chr8
on chrom:chr9
on chrom:chr10
on chrom:chr11
on chrom:chr12
on chrom:chr13
on chrom:chr14
on chrom:chr15
on chrom:chr16
on chrom:chr17
on chrom:chr18
on chrom:chr19
on chrom:chr20
on chrom:chr21
on chrom:chr22
on chrom:chrX
on chrom:chrY
on chrom:chrM
160251 candidate variants
/home/brentp/src/somalier/src/somalier.nim(266) somalier
/home/brentp/src/somalier/src/somalier.nim(253) main
/home/brentp/src/somalier/src/somalierpkg/findsites.nim(249) findsites_main
/nim/lib/pure/collections/tables.nim(858) []
/nim/lib/pure/collections/tables.nim(262) []
Error: unhandled exception: key not found: chrX [KeyError]


Thanks in advance!

I see the problem. It is fixed in this binary
somalier_dbg2.gz

I was assuming that the table would have entries for each chromosome which is obviously not going to happen in every case, including yours.

So now the error message is gone but not sure it's working properly, most likely these things were also present in my original run, I just didn't check thoroughly. I seem to have SNPs which are not far (10000 bp) apart from each other after giving "--snp-dist=10000", there are many cases like this below. In addition I don't have any SNPs from e.g., chr8, and I know that there are many SNPs matching the parameters --min-AN=1 --min-AF=0.4 --snp-dist=10000
Also, in the sites.vcf.gz the program is still giving output from chrX/Y as well but they are not filtered...

chr7 4620726 . A C 100 . AC=3;AF=0.5
chr7 4620753 . T G 100 . AC=3;AF=0.5

chr7 4625241 . A G 100 . AC=3;AF=0.5
chrX 2781514 . C A 100 . AC=3;AF=0.5
chrX 2781635 . G A 100 . AC=1;AF=0.167
chrX 2781927 . A G 100 . AC=3;AF=0.5
chrX 2781986 . T C 100 . AC=3;AF=0.5
chrX 2782161 . A G 100 . AC=1;AF=0.167
chrX 2782261 . T A 100 . AC=1;AF=0.167

chrX 2782567 . C T 100 . AC=2;AF=0.333
chrX 2782572 . A G 100 . AC=3;AF=0.5

Here is the run command/output:


$ ./somalier_dbg2 find-sites --min-AN=1 --min-AF=0.4 --snp-dist=10000 batch1.Sentieon.biSNPs.vcf.gz
somalier version: 0.2.15
on chrom:chr1
on chrom:chr2
on chrom:chr3
on chrom:chr4
on chrom:chr5
on chrom:chr6
on chrom:chr7
on chrom:chr8
on chrom:chr9
on chrom:chr10
on chrom:chr11
on chrom:chr12
on chrom:chr13
on chrom:chr14
on chrom:chr15
on chrom:chr16
on chrom:chr17
on chrom:chr18
on chrom:chr19
on chrom:chr20
on chrom:chr21
on chrom:chr22
on chrom:chrX
on chrom:chrY
on chrom:chrM
319120 candidate variants
sorted and filtered to 152805 variants. now dropping INFOs and writing
[somalier] wrote 65535 variants to:sites.vcf.gz


As a more general approach, as an alternative, I can also e.g., filter my vcf file for AC values I choose and then use "plink --indep" to remove linked sites, right?

I will also try to just use a vcf with only autosomes, not sure if the sex chromosomes are interfering with the overall calculations.

Hi, sorry for your trouble.
Don't use plink to remove sites, somalier should do that for you. Also, only the autosomes are used in the relatedness calculations, but the X and Y are used to validate sex.
I see that I had some uncommitted changes in addition to the fix. Those previous changes are causing the problem you see. I think I see the easy fix, but want to try to verify so you're not running my debug cycle for me. If you give me a few hours, I think this will be resolved.

Thanks Brent, no worries and sounds good ;)

OK. I think this will work correctly for you.
I would lower the snp-dist and the min-af if you need to get more sites.
somalier_dbg3.gz

Thanks, we are getting close but probably not fully there yet!
Please see the output below, in short I think we still have all the unfiltered sex chromosomes. I.e., the output should only have ca. 26000 sites, but there are ca. 176000 sites in the sites.vcf.gz. In total we have ca. 150000 sex chromosomes and they are unfiltered.


chrX 2782161 . A G 100 . AC=1;AF=0.167
chrX 2782261 . T A 100 . AC=1;AF=0.167



$ ./somalier_dbg3 find-sites --min-AN=1 --min-AF=0.4 --snp-dist=50000 batch1.Sentieon.biSNPs.vcf.gz
somalier version: 0.2.16
on chrom:chr1
on chrom:chr2
on chrom:chr3
on chrom:chr4
on chrom:chr5
on chrom:chr6
on chrom:chr7
on chrom:chr8
on chrom:chr9
on chrom:chr10
on chrom:chr11
on chrom:chr12
on chrom:chr13
on chrom:chr14
on chrom:chr15
on chrom:chr16
on chrom:chr17
on chrom:chr18
on chrom:chr19
on chrom:chr20
on chrom:chr21
on chrom:chr22
on chrom:chrX
on chrom:chrY
on chrom:chrM
319120 candidate variants
sorted and filtered to 27062 variants. now dropping INFOs and writing
[somalier] wrote 27062 variants to:sites.vcf.gz
[ashmar@i-07-l0003 results.batch1]$ less sites.vcf.gz |grep -v "^#"|wc
176682 1413456 7003099
[ashmar@i-07-l0003 results.batch1]$ less sites.vcf.gz |grep "^chrX"|wc
134008 1072064 5350439
[ashmar@i-07-l0003 results.batch1]$ less sites.vcf.gz |grep "^chrY"|wc
15612 124896 606859
[ashmar@i-07-l0003 results.batch1]$ less sites.vcf.gz |grep "^chr1"|wc
12048 96384 469129


Hi, it's intentionally getting more values on X to evaluate sex. This is too many, in your case, and I'll make a change for that, but it will not affect your relatedness calculations.

I'll update this message:

sorted and filtered to 27062 variants. now dropping INFOs and writing

to indicate autosomal.

Would these changes address your concerns?

The other thing to clarify is that the sex chromosome have different filtering because it's only for quality and linkage doesn't matter.

And here's another binary with those changes in case you want to try it. It only updates the message and caps the number of X and y variants it will use.
somalier_dbg4.gz

Super, thanks a lot Brent for the prompt response, I think it looks great! Yes, the X/Y chromosomes are now also filtered for quality.

Hi Brent, I've been using the binary of the Somalier (somalier_dbg4.gz) that you shared with me (in this thread above), and it's the Somalier's version: 0.2.16. However, I realised that the latest version in the releases and the docker file is actually the v0.2.15. I know that there are only minor changes in the versions, but I was wondering whether you will also push that latest v0.2.16 to docker and github? Thanks, Ashot