oicr-gsi / mutect2ConsensusTumorOnly

The Mutect2Consensus workflow with only tumor inputs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mutect2ConsensusTumorOnly

The Mutect2Consensus workflow will process umiConsensus outputs for the tumour data through mutect2 in tumour only mode to call variants and annotation.

Overview

Dependencies

Usage

Cromwell

java -jar cromwell.jar run mutect2ConsensusTumorOnly.wdl --inputs inputs.json

Inputs

Required workflow parameters:

Parameter Value Description
tumorInputGroup InputGroup partitioned bam files from umiConsensus outputs for tumor sample
outputFileNamePrefix String Prefix to use for output file
intervalFile String interval file to subset variant calls
inputIntervalsToParalellizeBy String intervals for parallelization
tumorName String Name of the tumor sample
reference String reference version
combineVariants.workflows Array[String] array of ids of producer workflows

Optional workflow parameters:

Parameter Value Default Description

Optional task parameters:

Parameter Value Default Description
mutect2.filter_timeout Int 12 Hours before task timeout
mutect2.filter_memory Int 16 Memory allocated for job
mutect2.filter_filterExtraArgs String? None Extra arguments
mutect2.mergeStats_timeout Int 5 Hours before task timeout
mutect2.mergeStats_memory Int 4 Memory allocated for job
mutect2.mergeStats_modules String "gatk/4.1.6.0" Names and versions of modules to load
mutect2.mergeVCFs_timeout Int 12 Hours before task timeout
mutect2.mergeVCFs_memory Int 4 Memory allocated for job
mutect2.runMutect2_timeout Int 24 Hours before task timeout
mutect2.runMutect2_memory Int 32 Memory allocated for job
mutect2.runMutect2_threads Int 4 Number of threads to request
mutect2.runMutect2_mutect2ExtraArgs String? None Extra arguments
mutect2.runMutect2_mutectTag String "mutect2" Tag
mutect2.splitStringToArray_modules String "" Names and versions of modules to load
mutect2.splitStringToArray_timeout Int 1 Hours before task timeout
mutect2.splitStringToArray_memory Int 1 Memory allocated for job
mutect2.splitStringToArray_lineSeparator String "," line separator
mutect2.normalBam File? None Input normal file (bam or sam)
mutect2.normalBai File? None Index file for normal bam
mutect2.pon File? None pon
mutect2.ponIdx File? None pon ID
mutect2.gnomad File? None gnomad
mutect2.gnomadIdx File? None gnomad ID
getFileName.jobMemory Int 4 memory allocated to preprocessing, in GB
getFileName.timeout Int 1 timeout in hours
getFileName.threads Int 1 number of cpu threads to be used
combineVariants.jobMemory Int 24 memory allocated to preprocessing, in GB
combineVariants.timeout Int 20 timeout in hours
combineVariants.threads Int 8 number of cpu threads to be used
annotation.modules String "samtools/1.9 bcftools/1.9 htslib/1.9 tabix/1.9" module for running preprocessing
annotation.jobMemory Int 24 memory allocated to preprocessing, in GB
annotation.timeout Int 20 timeout in hours
annotation.threads Int 8 number of cpu threads to be used
variantEffectPredictor.mergeVcfs_timeout Int 24 Maximum amount of time (in hours) the task can run for.
variantEffectPredictor.mergeVcfs_threads Int 4 Requested CPU threads.
variantEffectPredictor.mergeVcfs_overhead Int 6 Java overhead memory (in GB). jobMemory - overhead == java Xmx/heap memory.
variantEffectPredictor.mergeVcfs_jobMemory Int 24 Memory allocated to job (in GB).
variantEffectPredictor.mergeVcfs_extraArgs String? None Additional arguments to be passed directly to the command.
variantEffectPredictor.mergeVcfs_modules String "gatk/4.1.7.0" Required environment modules.
variantEffectPredictor.mergeMafs_timeout Int 24 Maximum amount of time (in hours) the task can run for.
variantEffectPredictor.mergeMafs_threads Int 4 Requested CPU threads.
variantEffectPredictor.mergeMafs_jobMemory Int 24 Memory allocated to job (in GB).
variantEffectPredictor.mergeMafs_modules String "tabix/0.2.6" Required environment modules
variantEffectPredictor.vcf2maf_timeout Int 48 Hours before task timeout
variantEffectPredictor.vcf2maf_threads Int 4 Requested CPU threads
variantEffectPredictor.vcf2maf_jobMemory Int 32 Memory allocated for this job (GB)
variantEffectPredictor.vcf2maf_bufferSize Int 200 The buffer size
variantEffectPredictor.vcf2maf_minHomVaf Float 0.7 The minimum vaf for homozygous calls
variantEffectPredictor.vcf2maf_vepStats Boolean true If vepStats is true, remove flag '--no_stats' from vep. If vepStats is false, running vep with flag '--no_stats'
variantEffectPredictor.vcf2maf_species String "homo_sapiens" Species name
variantEffectPredictor.vcf2maf_basename String basename("~{vcfFile}",".vcf.gz") Base name
variantEffectPredictor.tumorOnlyAlign_timeout Int 6 Hours before task timeout
variantEffectPredictor.tumorOnlyAlign_threads Int 4 Requested CPU threads
variantEffectPredictor.tumorOnlyAlign_jobMemory Int 32 Memory allocated for this job (GB)
variantEffectPredictor.tumorOnlyAlign_modules String "bcftools/1.9 tabix/0.2.6" Required environment modules
variantEffectPredictor.tumorOnlyAlign_basename String basename("~{vcfFile}",".vcf.gz") Base name
variantEffectPredictor.vep_timeout Int 16 Hours before task timeout
variantEffectPredictor.vep_threads Int 4 Requested CPU threads
variantEffectPredictor.vep_jobMemory Int 32 Memory allocated for this job (GB)
variantEffectPredictor.vep_vepStats Boolean true If vepStats is true, remove flag '--no_stats' from vep. If vepStats is false, running vep with flag '--no_stats'
variantEffectPredictor.vep_species String "homo_sapiens" Species name
variantEffectPredictor.vep_addParam String? None Additional vep parameters
variantEffectPredictor.vep_basename String basename("~{vcfFile}",".vcf.gz") Base name
variantEffectPredictor.subsetVcf_timeout Int 6 Maximum amount of time (in hours) the task can run for.
variantEffectPredictor.subsetVcf_threads Int 4 Requested CPU threads.
variantEffectPredictor.subsetVcf_jobMemory Int 32 Memory allocated to job (in GB).
variantEffectPredictor.subsetVcf_modules String "bcftools/1.9" Required environment modules
variantEffectPredictor.subsetVcf_basename String basename("~{vcfFile}",".vcf.gz") Base name
variantEffectPredictor.chromosomeArray_timeout Int 1 Maximum amount of time (in hours) the task can run for.
variantEffectPredictor.chromosomeArray_threads Int 4 Requested CPU threads.
variantEffectPredictor.chromosomeArray_jobMemory Int 1 Memory allocated to job (in GB).
variantEffectPredictor.getSampleNames_timeout Int 1 Hours before task timeout
variantEffectPredictor.getSampleNames_threads Int 4 Requested CPU threads
variantEffectPredictor.getSampleNames_jobMemory Int 1 Memory allocated for this job (GB)
variantEffectPredictor.targetBedTask_timeout Int 6 Hours before task timeout
variantEffectPredictor.targetBedTask_threads Int 4 Requested CPU threads
variantEffectPredictor.targetBedTask_jobMemory Int 32 Memory allocated for this job (GB)
variantEffectPredictor.targetBedTask_modules String "bedtools/2.27 tabix/0.2.6" Required environment modules
variantEffectPredictor.targetBedTask_basename String basename("~{vcfFile}",".vcf.gz") Base name
variantEffectPredictor.normalName String? None Name of the normal sample
filterMaf.mafNormalFile File? None input file for normal sample
filterMaf.freqList String "$MAF_FILTERING_ROOT/TGL.frequency.20210609.annot.txt" frequency list used in maf annotation
filterMaf.genesToKeep String "$MAF_FILTERING_ROOT/genes_to_keep.txt" gene list in maf filtering
filterMaf.modules String "python/3.9 pandas/1.4.2 maf-filtering/2023-10-06" module for running preprocessing
filterMaf.jobMemory Int 8 memory allocated to preprocessing, in GB
filterMaf.timeout Int 1 timeout in hours
filterMaf.threads Int 1 number of cpu threads to be used

Outputs

Output Type Description Labels
tumorDcsScVcf File DCS vcf for tumor sample vidarr_label: tumorDcsScVcf
tumorDcsScVcfIndex File DCS vcf index for tumor sample vidarr_label: tumorDcsScVcfIndex
tumorSscsScVcf File SSCS vcf for tumor sample vidarr_label: tumorSscsScVcf
tumorSscsScVcfIndex File SSCS vcf index for tumor sample vidarr_label: tumorSscsScVcfIndex
tumorAllUniqueVcf File vcf of DCS + singletons for tumor sample vidarr_label: tumorAllUniqueVcf
tumorAllUniqueVcfIndex File vcf index for DCS + singletons for tumor sample vidarr_label: tumorAllUniqueVcfIndex
tumorVepVcf File vep vcf for tumor sample vidarr_label: tumorVepVcf
tumorVepVcfIndex File vep vcf index for tumor sample vidarr_label: tumorVepVcfIndex
tumorMafOutput File? maf output for tumor sample vidarr_label: tumorMafOutput
filterredMaf File? maf file after filtering vidarr_label: filterredMaf

Commands

This section lists command(s) run by mutect2ConsensusTumorOnly workflow

  • Running mutect2ConsensusTumorOnly
     basename ~{fileName} | cut -d. -f1 

Preprocess and CombineVariants

   python3<<CODE
   import subprocess
   import sys
   inputStrings = []
   v = "~{sep=' ' inputVcfs}"
   vcfFiles = v.split()
   w = "~{sep=' ' workflows}"
   workflowIds = w.split()
   priority = "~{priority}"
   
   if len(vcfFiles) != len(workflowIds):
       print("The arrays with input files and their respective workflow names are not of equal size!")
   else:
       for f in range(0, len(vcfFiles)):
           inputStrings.append("--variant:" + workflowIds[f] + " " + vcfFiles[f])
 
   javaMemory = ~{jobMemory} - 6 
   gatkCommand  = "$JAVA_ROOT/bin/java -Xmx" + str(javaMemory) + "G -jar $GATK_ROOT/GenomeAnalysisTK.jar "
   gatkCommand += "-T CombineVariants "
   gatkCommand += " ".join(inputStrings)
   gatkCommand += " -R ~{referenceFasta} "
   gatkCommand += "-o ~{outputPrefix}_combined.vcf.gz "
   gatkCommand += "-genotypeMergeOptions PRIORITIZE "
   gatkCommand += "-priority " + priority
   gatkCommand += " 2>&1"
 
   result_output = subprocess.run(gatkCommand, shell=True)
   sys.exit(result_output.returncode)
   CODE

Annotate with bcftools

   bcftools annotate -a ~{uniqueVcf} \
  -c FMT/AD,FMT/DP ~{mergedVcf} -Oz \
  -o "~{outputPrefix}.merged.vcf.gz"
 
  tabix -p vcf "~{outputPrefix}.merged.vcf.gz"

Generate consensus calls

     python3<<CODE
     ## Adapted from https://github.com/oicr-gsi/djerba/blob/GCGI-806_v1.0.0-dev/src/lib/djerba/plugins/tar/snv_indel/plugin.py
     ## this code will filter a maf file, generated from tumor-only mutect2 calls 
     import pandas as pd
     maf_file_path = "~{mafFile}"
     maf_normal_path = "~{mafNormalFile}"
     freq_list_path = "~{freqList}"
     output_path_prefix = "~{outputPrefix}"
     genes_to_keep_path = "~{genesToKeep}"
 
     if maf_normal_path:
       df_bc = pd.read_csv(maf_normal_path,
                       sep = "\t",
                       on_bad_lines="error",
                       compression='gzip',
                       skiprows=[0])
 
     df_pl = pd.read_csv(maf_file_path,
                     sep = "\t",
                     on_bad_lines="error",
                     compression='gzip',
                     skiprows=[0])
     df_freq = pd.read_csv(freq_list_path,
                   sep = "\t")
     with open(genes_to_keep_path) as f:
       GENES_TO_KEEP = f.read()
 
 
     for row in df_pl.iterrows():
       hugo_symbol = row[1]['Hugo_Symbol']
       chromosome = row[1]['Chromosome']
       start_position = row[1]['Start_Position']
       reference_allele = row[1]['Reference_Allele']
       allele = row[1]['Allele']
 
       # If there is normal input, annotate rows with information from the matched normal and from the frequency table
       if maf_normal_path:
         # Lookup the entry in the BC and annotate the tumour maf with
         #   n_depth, n_ref_count, n_alt_count
 
         row_lookup = df_bc[(df_bc['Hugo_Symbol'] == hugo_symbol) & 
                     (df_bc['Chromosome'] == chromosome) & 
                     (df_bc['Start_Position'] == start_position) &
                     (df_bc['Reference_Allele'] == reference_allele) &
                     (df_bc['Allele'] == allele)]
 
 
         # If there's only one entry, take its normal values
         if len(row_lookup) == 1:
             df_pl.at[row[0], "n_depth"] = row_lookup['n_depth'].item()
             df_pl.at[row[0], "n_ref_count"] = row_lookup['n_ref_count'].item()
             df_pl.at[row[0], "n_alt_count"] = row_lookup['n_alt_count'].item()
       
         # If the entry isn't in the table, 
         # or if there is more than one value and so you can't choose which normal values to take, 
         # set them as 0
         else:
             df_pl.at[row[0], "n_depth"] = 0
             df_pl.at[row[0], "n_ref_count"] = 0
             df_pl.at[row[0], "n_alt_count"] = 0
             
       # Lookup the entry in the frequency table and annotate the tumour maf with Freq
     
       row_lookup = df_freq[(df_freq['Start_Position'] == row[1]['Start_Position']) &
                           (df_freq['Reference_Allele'] == row[1]['Reference_Allele']) &
                           ((df_freq['Tumor_Seq_Allele'] == row[1]['Tumor_Seq_Allele1']) |
                           (df_freq['Tumor_Seq_Allele'] == row[1]['Tumor_Seq_Allele2']))]
 
       if len(row_lookup) > 0:
           df_pl.at[row[0], 'Freq'] = row_lookup['Freq'].item()
       else:
           df_pl.at[row[0], 'Freq'] = 0
 
     # Filter the maf to remove rows based on various criteria, but always maintaining genes in the GENES_TO_KEEP list  
     for row in df_pl.iterrows():
         hugo_symbol = row[1]['Hugo_Symbol']
         frequency = row[1]['Freq']
         gnomAD_AF = row[1]['gnomAD_AF']
         n_alt_count = row[1]['n_alt_count']
         if hugo_symbol not in GENES_TO_KEEP or frequency > 0.1 or n_alt_count > 4 or gnomAD_AF > 0.001:
             df_pl = df_pl.drop(row[0])   
 
     df_pl.to_csv(output_path_prefix + '_filtered_maf_for_tar.maf.gz', sep = "\t", compression='gzip', index=False)
     CODE

Support

For support, please file an issue on the Github project or send an email to gsi@oicr.on.ca .

Generated with generate-markdown-readme (https://github.com/oicr-gsi/gsi-wdl-tools/)

About

The Mutect2Consensus workflow with only tumor inputs


Languages

Language:WDL 98.7%Language:Shell 1.3%