Marker alignments in Nextflow
Introduction
CORRAL is a Nextflow pipeline wrapping a Python module, marker_alignments
, and combining it with fetch and align steps, to provide a workflow for estimating what taxa are present in the sample.
Our article about CORRAL, "Improved eukaryotic detection compatible with large-scale automated analysis of metagenomes", has now been published in Microbiome: https://doi.org/10.1186/s40168-023-01505-1.
Installation
This workflow is not containerised, but the dependencies are quite minimal:
bowtie2
- Marker alignments package and its tool
marker_alignments
.
Additionally, samtools stats
is the default and recommended for alignment stats.
By default, bowtie2
, marker_alignments
and samtools
are assumed to be on $PATH
but you can provide a path to an executable in the pipeline config.
If you want to use --downloadMethod wget
you also need wget
. If you want to use --downloadMethod sra
you need the SRA EUtils, with prefetch
and fastq-dump
on $PATH
.
--unpackMethod bz2
requires bzip2
on $PATH
.
You also need a bowtie2
reference database of taxonomic markers, like ChocoPhlAn or EukDetect.
Docker support
This version of the pipeline accompanies the original CORRAL paper, contains all code needed to reproduce the results, and will remain supported by the authors. MicrobiomeDB data production since June 2022 is operated by the whole VEuPathDB group which undertook further development of CORRAL under a fork, which is also freely available here under the same license.
See their Dockerfile and their nextflow.config for how they modified this Nextflow pipeline to enable Docker support.
Usage
Summary of input params
Main parameters:
param | value type | description |
---|---|---|
inputPath | path to file | TSV: sample ID,fastq URL or run ID, [second URL for paired reads] |
downloadMethod | "wget" / "sra" / "local" | |
libraryLayout | "single" / "paired" | |
resultDir | path to dir | publish directory |
refdb | path pattern | bowtie2 -x parameter |
bowtie2Command | shell | Run bowtie2 |
alignmentStatsCommand | shell | samtools stats by default. Set to 'none' to switch off |
summarizeAlignmentsCommand | shell | path to marker_alignments optionally with filter arguments to use |
Optional parameters:
param | value type | description |
---|---|---|
markerToTaxonPath | path to file | summarize_marker_alignments --refdb-marker-to-taxon-path parameter |
unpackMethod | "bz2" | for FTP .tar.bz2 content |
How to use this software
This is research software. You can use it as is to check you are getting the results similar to the ones we did, and you can also build upon it.
The Python module marker_alignments
does almost all the tricks, but it requires alignments as input. Meanwhile, this pipeline helps you orchestrate the process of downloading fastqs and running the alignments, so you can do eukaryotic detection at scale.
Reference databases
You will need to provide --refdb
and --markerToTaxonPath
so that they correspond to your chosen reference. For the publication, we used EukDetect's databases: see their documentation for how to download them.
Execution environment
We ran this pipeline locally on a Ubuntu laptop, and on our LSF cluster, adding a cluster.conf
file that made sense for our run. You might need to adjust the Nextflow commands to make them suitable for your execution environment.
Experimenting with the method
If you want to experiment with the method - for example, see what happens if you do not filter at all - you can override the default summarizeAlignmentsCommand
parameter and provide a different --resultDir
. Nextflow is able to reuse previously done steps, so changing summarizeAlignmentsCommand
will not re-run the download or alignment steps. Here is an example.
Example - cluster run
run.sh
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
REF_PATH="~/eukprot"
nextflow pull wbazant/CORRAL -r main
nextflow run wbazant/CORRAL -r main \
--inputPath $DIR/in.tsv \
--resultDir $DIR/results \
--downloadMethod wget \
--unpackMethod bz2 \
--libraryLayout paired \
--refdb ${REF_PATH}/ncbi_eukprot_met_arch_markers.fna \
--markerToTaxonPath ${REF_PATH}/busco_taxid_link.txt \
-c $DIR/cluster.conf \
-with-trace -resume | tee $DIR/tee.out
cluster.conf
process {
executor = 'lsf'
maxForks = 60
withLabel: 'download' {
maxForks = 5
maxRetries = 3
}
withLabel: 'align' {
errorStrategy = 'finish'
}
}
If you want to run the pipeline locally, remove executor = 'lsf'
, and reduce the number of forks. Raising it above what you can download in parallel, or to the number of cores of your CPU, will not speed things up, so for example maxForks = 3
will be a good value. Monitor the temperature of your machine and make sure it does not overheat: bowtie2
uses the CPU really intensely.
in.tsv
SRS011061 https://downloads.hmpdacc.org/dacc/hhs/genome/microbiome/wgs/analysis/hmwgsqc/v1/SRS011061.tar.bz2
SRS011086 https://downloads.hmpdacc.org/dacc/hhs/genome/microbiome/wgs/analysis/hmwgsqc/v1/SRS011086.tar.bz2
This is the correct content for --downloadMethod wget --unpackMethod bz2
. For other input combinations, check the pipeline code - it's usually two or three columns. The first one is an sample ID, the second one is path or URL, and for the libraryLayout paired
and no --unpackMethod
it's pathForward then pathReverse.