luntergroup / octopus

Bayesian haplotype-based mutation calling

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

somatic calling: inconsistent ploidy

jbedo opened this issue · comments

Describe the bug

Calls in the same phase set have different plodies.

Version

$ octopus --version
octopus version 0.7.4
Target: x86_64 Linux 3.10.0-957.5.1.el7.x86_64
SIMD extension: AVX2
Compiler: GNU 10.3.0
Boost: 1_77

NB: this is dev version compiled from 4ab47e2 after patching for #228.

Command
Command line to install octopus:

$ 

Command line to run octopus:

$ octopus -R ref.fa -I d9fjf5p5m6wr5wzn0jbin35kkbj5m7bb-bionix-samtools-sort.bam f99xifv51srranjbdwc7xn22rsik7g3c-bionix-samtools-sort.bam q0zzgw47s3f68919yx8vgxf63x3f2i5y-bionix-samtools-sort.bam -o /nix/store/dhihczfa8yyl38l9g47zplzmh5dpdckm-bionix-octopus-callSomatic --bamout /nix/store/l7iy3kvmj8qnd5sblcvl4yd4lz1805di-bionix-octopus-callSomatic-evidence --threads=1 --fast --max-genotypes 1000 -t /nix/store/93ib5nh7x30bsvi31j7q9ss69n2my8dw-regions.txt -N blood --debug=debug.log --annotations AF

Additional context

Debug log attached.

debug.log

Thanks for looking into this (sorry again for slow response). Ok so what's happening is that some germline variants get called in haplotype window chr10:133069531-133069536 (2 haplotypes):

[2022-01-20 05:00:55] <DEBG> Called germline variants:
[2022-01-20 05:00:55] <DEBG> chr10:133069531-133069532 C ->  470.515
    chr10:133069533-133069534 G -> C 649.762
    chr10:133069534-133069535 T -> G 470.515
    chr10:133069535-133069536 A -> G 649.762

but then in the next (adjacent) window chr10:133069536-133069539 a somatic gets called (3 haplotypes):

[2022-01-20 05:00:55] <DEBG> Called somatic variants:
[2022-01-20 05:00:55] <DEBG> chr10:133069538-133069539 G -> C 70.6407
[2022-01-20 05:00:55] <DEBG> Called germline variants:
[2022-01-20 05:00:55] <DEBG> chr10:133069536-133069538 GG ->  565.485

The problem is that the germline deletion requires a pad base when converting to VCF, but that's within the previous diploid region. Phase sets cannot overlap so we have a conflict.

These windowing artefacts are a real nuisance - the ideal solution is to call the two regions together as that will ensure consistent 'ploidy' and there is some logic in place to try to avoid this happening but it's not watertight, particularly when running in --fast mode as that disables 'lagging' - no overlapping haplotype windows.

Thanks for the explanation, I've verified that removal of --fast fixes this particular region at the cost of a 7x slowdown.

Hmm curious, I thought I'd try and confirm it is lagging so I dropped fast and set --lagging-level none, but it successfully called the region. Trying to set other parameters similarly to --fast (i.e., --lagging-level NONE --max-haplotypes 50 --max-genotype-combinations 1000 --max-genotypes 1000) also failed to reproduce the error.

Probably because --fast also turns off local reassembly (i.e. --disable-assembly-candidate-generator) which removes candidates causing the issue.

I've managed to reproduce this bug on a different sample without --fast, the only non-default flags being --max-genotypes 500. Would an additional debug log from this sample help or would you need access to the reads?