qiaseq-dna

This repository contains some example code for processing reads from QIAGEN QIAseq DNA enrichment kits.

python run examples

The top level script run_qiaseq_dna.py process reads in fastq format to produce an annotated vcf along with relevant UMI and read level metrics. Variants are called using smCounter. The smCounter-v1 variant calling procedure is described here:

Detecting very low allele fraction variants using targeted DNA sequencing and a novel molecular barcode-aware variant caller, BMC Genomics, 2017 18:5

The smCounter variant caller uses a statistical model that requires both raw sequencer base calls and identification of unique input molecules using the UMI tag and the genome position of the random fragmentation site.

smCounter-v2 is described in this publication:

smCounter2: an accurate low-frequency variant caller for targeted sequencing data with unique molecular identifiers, Bioinformatics, 06 September 2018

We are actively developing smCounter-v2 to support Duplex variant calling as well, the method is desribed in the below publication:

Targeted Single Primer Enrichment Sequencing with Single End Duplex-UMI, Scientific Reports 9, Article number: 4810, 2019

There are additional scripts under misc_workflow/ which contain the following :

run_dedup - Use the Picard MarkDuplicates utility from the Broad Institute to remove PCR duplicate reads using the UMI and both paired-end read start locations on the reference genome. This might be useful for subsequent germline variant calling, or structural variant detection applications. Note that this script does not filter ligation chimera reads, nor does it remove PCR duplicate reads caused by internal re-priming by a downstream SPE primer.
run_consensus - Use the "fgbio" package from Fulcrum Genomics to create a single consensus read pair for each input molecule. This might be useful to prepare consensus read alignments for a subsequent SNP/indel variant calling procedure that does not use UMI-tagged reads (such as VarDict, MuTect, etc.). Users might need to tune the fgbio parameters (in addtion to variant calling parameters) for their application. We have not performed any variant calling performance benchmarking using the consensus-read BAM generated by this pipeline.

python packages

core

This package contains modules that trim common regions of the reads, align reads to the reference genome, identify putative original input molecules, and trim SPE primer regions from the genome alignments. These steps generate a BAM file suitable for subsequent variant calling.

metrics

This package contains auxiliary modules that provide read accounting and enrichment summary metrics such as uniformity and fragment length distribution.

qiaseq-smCounter-v1 @ *

This package contains the smCounter-v1 variant caller

qiaseq-smCounter-v2 @ *

This package contains the smCounter-v2 variant caller

annotate

This package contains downstream VCF annotation using snpEff.

Docker image for third-party dependencies

The python modules in this repository have many dependencies on third-party NGS software (e.g. BWA, samtools, etc.) and GNU Linux utilities (sort, zcat, etc). Please DO NOT ATTEMPT to use the python modules in this git repository without first running the code on the example read set using our Docker image:

### Pull the docker image
sudo docker pull qiaseq/qiaseq-dna

### Install gsutil , pip install gsutil would likely be enough. See here for details : https://cloud.google.com/storage/docs/gsutil_install#deb

### Get data dependencies, this will create a directory named data in your current folder
gsutil -m cp -r gs://qiaseq-dna/data ./

### cd to your_fav_dir and get example fastqs, roi and primer files
wget https://storage.googleapis.com/qiaseq-dna/example/NEB_S2_L001_R1_001.fastq.gz \
https://storage.googleapis.com/qiaseq-dna/example/NEB_S2_L001_R2_001.fastq.gz \
https://storage.googleapis.com/qiaseq-dna/example/DHS-101Z.primers.txt \
https://storage.googleapis.com/qiaseq-dna/example/DHS-101Z.roi.bed ./

### Run a container from the image above interactively, mounting the data directory and your run directory, the output files will also be created in this directory.
sudo docker run -it -v /home/your_data_dir/data:/srv/qgen/data/ -v /home/your_fav_dir/:/srv/qgen/example/ qiaseq/qiaseq-dna

### Change directory and get the latest code from github
cd /srv/qgen/code/
git clone --recursive https://github.com/qiaseq/qiaseq-dna.git

### Change to run directory and copy over parameters file
cd /srv/qgen/example
cp /srv/qgen/code/qiaseq-dna/run_sm_counter_v2.params.txt ./

### Edit the bottom of run_consensus.params.txt if you need to change the read set and primer file

### Run the pipeline
python /srv/qgen/code/qiaseq-dna/run_qiaseq_dna.py run_sm_counter_v2.params.txt v2 single NEB_S2 > run.log 2>&1 &

The parameters are explained below :

run_sm_counter_v2.params.txt : The config file with prepopulated parameters.

v2 : smCounter variant caller version to use. You can specify v1 or v2. Please use run_sm_counter_v2.params.txt if specifying v2.

single : Whether this is a single read set analysis or tumor-normal.

NEB_S2 : Corresponds to the name of the readSet, should match the section in the params file. For tumor-normal analysis please specify the two read set names delimited by a space.

For ion torrent reads

### convert unmapped bam to fastq, this can be done inside the container
bedtools bamtofastq -i {ubam} -fq {fastq1}

Where :

uBam : your unmapped bam file

fastq1 : R1 fastq file name

Update the parameters in the config file :

uBam : Add a new parameter in the read set section in the bottom with the path to the unmapped bam file

readFile1 : Should be the R1 fastq obtained from the above bedtools command

readFile2 : Replace 'R1' with 'R2' in readFile1

platform : Change this to 'IonTorrent'

Run the python script run_qiaseq_dna.py as before

For tumor-normal analysis

Create 2 read set sections in the params file, a tumor and a normal.

In the tumor read set section :

Set runCNV = True , for obtaining copy number estimates.

Also add a new parameter, refUmiFiles = /your/run_dir/{normal_readset}.sum.primer.umis.txt ; where {normal_readset} is the name for your normal sample.

You may also give a comma delimited string with the paths to multiple sum.primer.umis.txt files for the CNV normalization.

Run the pipeline as :

python /srv/qgen/code/qiaseq-dna/run_qiaseq_dna.py run_sm_counter_v2.params.txt v2 tumor-normal tumor_readset normal_readset > run.log 2>&1 &

The dependencies are fully documented in the Dockerfile in this repository.

Please address questions to raghavendra.padmanabhan@qiagen.com, with CC to john.dicarlo@qiagen.com.

rpadmanabhan / qiaseq-dna