luntergroup / octopus

Bayesian haplotype-based mutation calling

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

elongated runtimes for deep-sequencing data in cancer calling mode

iserf opened this issue · comments

commented

Hi, I am trying to use octopus for liquid biopsy samples. In this context I have deep (10,000x) targeted (~1.3Mb Panel) sequencing data which are preprocessed by deduplication and bqsr (gatk). So coverage of my input bam file is ~1,800x (tumor) and ~1,000x (normal).

I am using octopus v0.7.4 via the docker image on WSL2.

Command line to run octopus:
octopus --threads
-R $REFERENCE/Homo_sapiens_assembly38.fasta
--regions-file $BED
-I $BAM_DIR/$BAM_T
-I $BAM_N
--normal-sample $ID_N
--allow-octopus-duplicates
--downsample-above 5000
--downsample-target 1000
--min-candidate-credible-vaf-probability 0.5
--min-somatic-posterior 1.0
--min-variant-posterior 1.0
--min-expected-somatic-frequency 0.001
--min-credible-somatic-frequency 0.001
--min-supporting-reads 2
--normal-contamination-risk LOW
--output $OUT_DIR/${ID_T}_octopus_calls.vcf
--sequence-error-model PCR.NOVASEQ
--bamout $OUT_DIR/${ID_T}_octopus_calls.realigned.bam
--somatic-forest /opt/octopus/resources/forests/somatic.v0.7.4.forest
--keep-unfiltered-calls
-w $OUT_DIR
--somatics-only

With this settings, the estimated runtime is ~5 days.

I tried to adjust several settings:
--max-genotypes 10
-X 200GB -B 12GB
--ignore-unmapped-contigs \

However, the only thing which reduced runtime (to ~4h) was setting max Haplotype number:
--max-haplotypes 20 \

This makes sense, since checking the debug log reveals that previously ~150-200 haplotypes were generated for each active region.

Does anyone use Octopus for similar application and knows what is a good number for max-haplotypes? Also it would be good to know the consequences of reducing max-haplotypes regarding sensitivity and specificity for variant calling.

commented

Hi Dan,

since I am still dealing with long runtimes for samples and settings as described above I am wondering if you have any recommendations for speeding up the runtime?

In the meantime I tried the following things (all without substantial runtime improvements):

  • Skip repeat regions as described in issue #72
  • bad-region-tolerance LOW
  • Perform chromosome-wise variant calling and merge vcfs afterwards

I am using the Agilent SureSelect Library Prep kit and deduplication with the AGeNT CReaK tool. Are you aware of any problems caused by this deduplication tool?

Also I am running octopus v.0.7.4 from the docket container on WSL2 for windows. Do you think this will impact variant calling speed, e.g. via thread usage?

thanks a lot in advance for your support!

Hi @iserf , do you solve this problem?

I encountered similar problems. My installed octopus is:

octopus version 0.7.4
Target: x86_64 Linux 5.11.0-1028-azure
SIMD extension: AVX512
Compiler: GNU 10.3.0
Boost: 1_74

I have 20 disease cases and 10 normal cases, each samples was sequenced using WGS(100X). After mapping and calling variants individually, I just tried jointly calling germline variation according to (official tutorial)[https://luntergroup.github.io/octopus/docs/guides/models/population]. When I tried the last step, i.e, joint call two subsets using only the variants called previously. However, the program have running more than 37 days and the output has not been updated since 25 days ago.
Here is my running commands and log:

octopus -R $ref_idx --disable-denovo-variant-discovery -I ${bams_arr[@]} -c ${vcf_arr[@]}  -o $pvcfout --threads $nthreads -p X=1 --temp-directory-prefix "octopus_tmp_${group}"  > $logx 2>&1
[2023-07-05 07:38:39] <INFO> ------------------------------------------------------------------------
[2023-07-05 07:38:39] <INFO> octopus v0.7.4
[2023-07-05 07:38:39] <INFO> Copyright (c) 2015-2021 University of Oxford
[2023-07-05 07:38:39] <INFO> ------------------------------------------------------------------------
[2023-07-05 08:12:30] <WARN> The population calling model is still in development. Do not use for production work!
[2023-07-05 08:12:30] <INFO> Done initialising calling components in 33m 50s
[2023-07-05 08:12:30] <INFO> Detected 32 samples: "1A" "2-11A" "2-12A" "2-13A" "2-14A" "2-15A" "2-16A" "2-1A" "2-2A" "2-3A" "2-4A" "2-6A" "2-8A" "2A" "3A" "4A" "5A" "6A" "7A" "8A" "9A" "L2104" "L2109" "L2110" "L262" "L263" "L267" "L276" "L318" "L521" "L524" "L548"
[2023-07-05 08:12:30] <INFO> Invoked calling model: population
[2023-07-05 08:12:30] <INFO> Processing 3,137,454,505bp with 96 threads (96 cores detected)
[2023-07-05 08:12:30] <INFO> Writing filtered calls to "/lustre/XXX/Octopus_outputs/Results.joint.vcf.gz"
[2023-07-05 08:18:06] <INFO> ------------------------------------------------------------------------
[2023-07-05 08:18:06] <INFO>      current      |                   |     time      |     estimated   
[2023-07-05 08:18:06] <INFO>      position     |     completed     |     taken     |     ttc         
[2023-07-05 08:18:06] <INFO> ------------------------------------------------------------------------
[2023-07-05 08:24:31] <WARN> Skipping region 1:1223465-1223949 as there are too many haplotypes
[2023-07-05 08:37:34] <INFO>       1:2891328             0.1%          19m 27s             1w 6d
.....

[2023-07-16 03:59:50] <INFO>     22:34933571            91.3%            1w 3d           24h 45m
[2023-07-16 04:18:59] <INFO>     22:38081163            91.4%            1w 3d           24h 28m
[2023-07-16 04:36:59] <INFO>     22:40872654            91.5%            1w 3d           24h 11m
[2023-07-16 04:54:57] <INFO>     22:44354398            91.6%            1w 3d           23h 54m
[2023-07-16 05:14:18] <INFO>     22:47098969            91.7%            1w 3d           23h 37m
[2023-07-16 05:33:29] <INFO>     22:50682395            91.8%            1w 3d           23h 19m

Could you please help me? @dancooke @chapmanb @bredelings Thanks!!

Hello @iserf and @Zjianglin,

I experienced a similar problem to what you report. What type of panel sequencing do you have?

We have a lot of amplicon sequencing data, and this problem can be partially mitigated by splitting the set of amplicon sequencing data into non-overlapping amplicons using https://github.com/crukci-bioinformatics/ampliconseq. You can then parallelise your computation on these non-overlapping amplicon sets for improved run time (assuming enough cores are available)

But it depends on your precise data type if this would help you or not? This will work for amplicon sequencing but but hybrid-capture libraries