Nanopore variant calling pipeline

Nextflow pipeline to call variants from Nanopore FASTQ files from bacterial clones relative to a wildtype control.

The pipeline broadly recapitualtes, where possible, the GATK best practices for germline short variant calling.

Processing steps

For each sample:

Quality Trim reads using cutadapt.
Map to genome FASTA using minimap2.
Mark duplicates with picard MarkDuplicates.
Re-align in "active" regions and calculate variant likelihood with GATK HaplotypeCaller.

Then merge resulting GVCFs using GATK CombineGVCFs. With the combined variant calls:

Joint genoptype with GATK GenotypeGVCFs.
Filter variants using GATK VariantFiltration.
Annotate variant effects using snpEff.
Filter out variants where all samples are identical to the wildtype control, which is assumed to be the sample_id which is alphabetically last.
Write to output TSV.

Other steps

Get FASTQ quality metrics with fastqc.
Generate alignment statistics with samtools stats.
Map to genome FASTA using bowtie2 because minimap2 logs are not compatible with multiqc. This way, some kind of alignment metrics are possible.
Compile the logs of processing steps into an HTML report with multiqc.

Requirements

Software

Nextflow
- At Crick, activate using module load Nextflow
- Otherwise, see below
conda or mamba
- If possible, use mamba because it will be faster.
- At Crick, activate using module load Anaconda3
GATK
- Download here, and provide the path as --gatk_path or in the nextflow.config file (see below)
Picard
- Download here, and provide the path as --picard_path or in the nextflow.config file (see below)
snpEff
- Download here, and provide the path as --snpeff_path or in the nextflow.config file (see below)

Reference genome

You also need the genome FASTA and GFF annotations for the bacteria you are sequencing. These can be obtained from NCBI Nucleotide:

Search for your strain of interest, and open its main page
On the right-hand side, click Customize view, then Customize and check Show sequence. Finally, click Update view. You may have to wait a few minute while the sequence downloads.
Click Send to: > Complete record > File > FASTA > Create file
Save the files to a path which you provide as --genome_fasta below.

First time using Nextflow?

If it's your first time using Nextflow on your system, you can install it using conda:

conda install -c bioconda nextflow

You may need to set the NXF_HOME environment variable. For example,

mkdir -p ~/.nextflow
export NXF_HOME=~/.nextflow

To make this a permanent change, you can do something like the following:

mkdir -p ~/.nextflow
echo "export NXF_HOME=~/.nextflow" >> ~/.bash_profile
source ~/.bash_profile

Quick start

Make sure you have GATK, Picard, and snpEff on your system, and provide their paths as parameters on the command line or in your nextflow.config file.

Make a sample sheet (see below) and, optionally, a nextflow.config file in the directory where you want the pipeline to run. Then run Nextflow.

nextflow run scbirlab/nf-ont-call-variants

Each time you run the pipeline after the first time, Nextflow will use a locally-cached version which will not be automatically updated. If you want to ensure that you're using the very latest version of the pipeline, use the -latest flag.

nextflow run scbirlab/nf-ont-call-variants -latest

If you want to run a particular tagged version of the pipeline, such as v0.0.1, you can do so using

nextflow run scbirlab/nf-ont-call-variants -r v0.0.1

For help, use nextflow run scbirlab/nf-ont-call-variants --help.

The first time you run the pipeline on your system, the software dependencies in environment.yml will be installed. This may take several minutes.

Inputs

The following parameters are required:

sample_sheet: path to a CSV with information about the samples and FASTQ files to be processed
gatk_path: path to GATK executable
picard_path: path to Picard executable
snpeff_path: path to snpEff executable
genome_fasta: path to reference genome FASTA
snpeff_database: name of snpEff database to use for annotation. This should be derived from the same assembly as genome_fasta. You can get a list of databases using java -jar snpEff database. Database names often end in the assembly name, such as gca_000015005, which you can check matches your genome_fasta

The following parameters have default values which can be overridden if necessary.

trim_qual = 10 : For cutadapt, the minimum Phred score for trimming 3' calls
min_length = 10 : For cutadapt, the minimum trimmed length of a read. Shorter reads will be discarded

The parameters can be provided either in the nextflow.config file or on the nextflow run command.

Here is an example of the nextflow.config file:

params {
   
    gatk_path = "/path/to/gatk"
    picard_path = "/path/to/picard.jar"
    snpeff_path = "/path/to/snpEff.jar"

    sample_sheet = "/path/to/sample-sheet.csv"
    
    genome_fasta = "/path/to/MsmMC2155-CP000480.1.fasta"
    snpeff_database = "Mycolicibacterium_smegmatis_mc2_155_gca_000015005"

}

Alternatively, you can provide the parameters on the command line:

nextflow run scbirlab/nf-ont-call-variants \
    --sample_sheet /path/to/sample-sheet.csv \
    --gatk_path /path/to/gatk \
    --picard_path /path/to/picard.jar \
    --snpeff_path /path/to/snpEff.jar \
    --genome_fasta /path/to/MsmMC2155-CP000480.1.fasta \
    --snpeff_database Mycolicibacterium_smegmatis_mc2_155_gca_000015005

Sample sheet

The sample sheet is a CSV file providing information about which FASTQ files belong to which sample.

The file must have a header with the column names below, and one line per sample to be processed.

sample_id: the unique name of the sample. The wildtype must be named so that it is alphabetically last
reads: path to compressed FASTQ files derived from Nanopore sequencing

Here is an example of the sample sheet:

sample_id	reads
wt	/path/to/reads/WT/raw_reads.fastq.gz
mut1	/path/to/reads/mut1/raw_reads.fastq.gz

Outputs

Outputs are saved in the same directory as sample_sheet. They are organised under three directories:

processed: FASTQ files and logs resulting from alignments
tables: tables and VCF files corresponding to variant calls
multiqc: HTML report on processing steps

Issues, problems, suggestions

Add to the issue tracker.

Further help

Here are the help pages of the software used by this pipeline.

scbirlab / nf-ont-call-variants