oicr-gsi / strelkaSomatic

Strelka2 somatic variant calling workflow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

strelkaSomatic

Strelka variant caller in somatic mode

Overview

Dependencies

Usage

Cromwell

java -jar cromwell.jar run strelkaSomatic.wdl --inputs inputs.json

Inputs

Required workflow parameters:

Parameter Value Description
tumorBam File Input BAM file with tumor data
tumorBai File BAM index file for tumor data
normalBam File Input BAM file with normal data
normalBai File BAM index file for normal data
reference String Reference assembly id
snvsVcfGather.refIndex String fai of the reference assembly, determines the ordering by chromosome
indelsVcfGather.refIndex String fai of the reference assembly, determines the ordering by chromosome

Optional workflow parameters:

Parameter Value Default Description
bedFile String? None BED file designating regions to process
numChunk Int? None If BED file given, number of chunks in which to split each chromosome
outputFileNamePrefix String "strelkaSomatic" Prefix for output files

Optional task parameters:

Parameter Value Default Description
splitIntervals.gatk String "$GATK_ROOT/bin/gatk" GATK executable path
splitIntervals.splitIntervalsExtraArgs String? None Additional arguments for the 'gatk SplitIntervals' command
splitIntervals.nonRefModules String "gatk/4.1.2.0" Environment modules other than the genome refence
splitIntervals.memory Int 32 Memory allocated for job
splitIntervals.overhead Int 8 Memory overhead for running on a node
splitIntervals.timeout Int 72 Hours before task timeout
convertIntervalsToBed.modules String "python/3.7" Environment modules
convertIntervalsToBed.memory Int 16 Memory allocated for job
convertIntervalsToBed.timeout Int 4 Hours before task timeout
configureAndRunParallel.nonRefModules String "python/2.7 samtools/1.9 strelka/2.9.10" Environment module names other than genome reference
configureAndRunParallel.jobMemory Int 16 Memory allocated for job
configureAndRunParallel.threads Int 4 Number of threads for processing
configureAndRunParallel.timeout Int 4 Hours before task timeout
snvsVcfGather.modules String "gatk/4.1.2.0" Environment module names and version to load (space separated) before command execution
snvsVcfGather.gatk String "$GATK_ROOT/bin/gatk" GATK to use
snvsVcfGather.memory Int 16 Memory allocated for job
snvsVcfGather.timeout Int 12 Hours before task timeout
indelsVcfGather.modules String "gatk/4.1.2.0" Environment module names and version to load (space separated) before command execution
indelsVcfGather.gatk String "$GATK_ROOT/bin/gatk" GATK to use
indelsVcfGather.memory Int 16 Memory allocated for job
indelsVcfGather.timeout Int 12 Hours before task timeout
configureAndRunSingle.regionsBed String? None BED file designating regions to process
configureAndRunSingle.nonRefModules String "python/2.7 samtools/1.9 strelka/2.9.10" Environment module names other than genome reference
configureAndRunSingle.jobMemory Int 16 Memory allocated for job
configureAndRunSingle.threads Int 4 Number of threads for processing
configureAndRunSingle.timeout Int 4 Hours before task timeout

Outputs

Output Type Description
snvsVcf File VCF file with SNVs, .gz compressed
indelsVcf File VCF file with indels, .gz compressed

Commands

This section lists command(s) run by strelkaSomatic workflow

  • Running strelkaSomatic

Main task, configuring and running Strelka2

	set -eo pipefail

	~{writeBed}
	~{indexFeatureFile}

	configureStrelkaSomaticWorkflow.py \
	--normalBam ~{normalBam} \
	--tumorBam ~{tumorBam} \
	--referenceFasta ~{refFasta} \
	~{regionsBedArg} \
	--runDir .

	./runWorkflow.py -m local -j ~{threads}

Converting interval files to .bed

.bed files use 0-based index, we need to do a proper conversion of interval files

        python3 <<CODE
        import os, re
        intervalFiles = re.split(",", "~{sep=',' intervalFiles}")
        for intervalFile in intervalFiles:
            items = re.split("\.", os.path.basename(intervalFile))
            items.pop() # remove .interval_list suffix
            bedName = ".".join(items)+".bed"
            with open(intervalFile, 'r') as inFile, open(bedName, 'w') as outFile:
                for line in inFile:
                    if not re.match("@", line): # omit the GATK header
                        fields = re.split("\t", line.strip())
                        fields[1] = str(int(fields[1]) - 1)
                        outFile.write("\t".join(fields)+"\n")
        CODE

Splitting intervals

SplitIntervals is used to produce a requested number of files to make the analysis parallel, decreasing demand for resources per chunk and improving speed

	set -eo pipefail

	mkdir interval-files
	ln -s ~{refFai}
	ln -s ~{refDict}
	~{gatk} --java-options "-Xmx~{memory-8}g" SplitIntervals \
	-R ~{refFasta} \
	~{intervalsArg} \
	~{scatterArg} \
	--subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION \
	-O interval-files \
	~{splitIntervalsExtraArgs}

	cp interval-files/*.interval_list .

Combining vcf files after scatter

This task uses GatherVcfs tools from GATK suite which needs the files to be ordered. No overlap between covered features allowed.

   set -eo pipefail

   ~{gatk} GatherVcfs \
   -I ~{sep=" -I " vcfs} \
   -R ~{refFasta}
   -O ~{outputName}

   gzip ~{outputName}

Support

For support, please file an issue on the Github project or send an email to gsi@oicr.on.ca .

Generated with generate-markdown-readme (https://github.com/oicr-gsi/gsi-wdl-tools/)

About

Strelka2 somatic variant calling workflow

License:GNU General Public License v3.0


Languages

Language:WDL 97.6%Language:Shell 2.4%