nf-core/demultiplex is a bioinformatics pipeline used to demultiplex the raw data produced by next generation sequencing machines. At present, only Illumina sequencing data is supported.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!
On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the nf-core website.
- Reformatting the input sample sheet
- Searches for [Data] tag
- Splits 10X single cell samples into 10X, 10X-ATAC and 10X-DNA .csv files by searching in the sample sheet column DataAnalysisType for
10X-3prime
,10X-ATAC
and10X-CNV
. - Outputs the results of needing to run specific processes in the pipeline (can be only 10X single cell samples, mix of 10X single cell with non single cell samples or all non single cell samples)
- Checking the sample sheet for downstream error causing samples such as:
- a mix of short and long indexes on the same lane
- a mix of single and dual indexes on the same lane
- Processes that only run if there are issues within the sample sheet found by the sample sheet check process (CONDITIONAL):
- Creates a new sample sheet with any samples that would cause an error removed and create a a txt file of a list of the removed problem samples
- Run
bcl2fastq
on the newly created sample sheet and output the Stats.json file - Parsing the Stats.json file for the indexes that were in the problem samples list.
- Recheck newly made sample sheet for any errors or problem samples that did not match any indexes in the Stats.json file. If there is still an issue the pipeline will exit at this stage.
- Single cell 10X sample processes (CONDITIONAL):
NOTE: Must create CONFIG to point to CellRanger genome References
- Cell Ranger mkfastq runs only when 10X samples exist. This will run the process with
CellRanger
,CellRanger ATAC
, andCell Ranger DNA
depending on which sample sheet has been created. - Cell Ranger Count runs only when 10X samples exist. This will run the process with
Cell Ranger Count
,Cell Ranger ATAC Count
, andCell Ranger DNA CNV
depending on the output from Cell Ranger mkfastq. 10X reference genomes can be downloaded from the 10X site, a new config would have to be created to point to the location of these. Must add config to point Cell Ranger to genome references if used outside the Crick profile.
- Cell Ranger mkfastq runs only when 10X samples exist. This will run the process with
bcl2fastq
(CONDITIONAL):- Runs on either the original sample sheet that had no error prone samples or on the newly created sample sheet created from the extra steps.
- This is only run when there are samples left on the sample sheet after removing the single cell samples.
- The arguments passed in bcl2fastq are changeable parameters that can be set on the command line when initiating the pipeline. Takes into account if Index reads will be made into FastQ's as well
FastQC
runs on the pooled fastq files from all the conditional processes.FastQ Screen
runs on the pooled results from all the conditional processes. Must have own fastq_screen config to direct to.MultiQC
runs on each projects FastQC results produced.MultiQC_all
runs on all FastQC results produced.
The input sample sheet must adhere to Illumina standards as outlined in the table below. Additional columns for DataAnalysisType
and ReferenceGenome
are required for the correct processing of 10X samples. The order of columns does not matter but the case of column name's does.
Lane | Sample_ID | index | index2 | Sample_Project | ReferenceGenome | DataAnalysisType |
---|---|---|---|---|---|---|
1 | ABC11A2 | TCGATGTG | CTCGATGA | PM10000 | Homo sapiens | Whole Exome |
2 | SAG100A10 | SI-GA-C1 | SC18100 | Mus musculus | 10X-3prime | |
3 | CAP200A11 | CTCGATGA | PM18200 | Homo sapiens | Other |
-
Install
Nextflow
(>=21.10.3
) -
Install any of
Docker
,Singularity
(you can follow this tutorial),Podman
,Shifter
orCharliecloud
for full pipeline reproducibility (you can useConda
both to install Nextflow itself and also to manage software within pipelines. Please only use it within pipelines as a last resort; see docs). -
Download the pipeline and test it on a minimal dataset with a single command:
nextflow run nf-core/demultiplex -profile test,YOURPROFILE --outdir <OUTDIR>
Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (
YOURPROFILE
in the example command above). You can chain multiple config profiles in a comma-separated string.- The pipeline comes with config profiles called
docker
,singularity
,podman
,shifter
,charliecloud
andconda
which instruct the pipeline to use the named tool for software management. For example,-profile test,docker
. - Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use
-profile <institute>
in your command. This will enable eitherdocker
orsingularity
and set the appropriate execution settings for your local compute environment. - If you are using
singularity
, please use thenf-core download
command to download images first, before running the pipeline. Setting theNXF_SINGULARITY_CACHEDIR
orsingularity.cacheDir
Nextflow options enables you to store and re-use the images from a central location for future pipeline runs. - If you are using
conda
, it is highly recommended to use theNXF_CONDA_CACHEDIR
orconda.cacheDir
settings to store the environments in a central location for future pipeline runs.
- The pipeline comes with config profiles called
-
Start running your own analysis!
nextflow run nf-core/demultiplex --input samplesheet.csv --outdir <OUTDIR> --genome GRCh37 -profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
The nf-core/demultiplex pipeline comes with documentation about the pipeline usage, parameters and output.
The nf-core/demultiplex pipeline was written by Chelsea Sawyer from The Bioinformatics & Biostatistics Group for use at The Francis Crick Institute, London.
Many thanks to others who have helped out along the way too, including (but not limited to): @ChristopherBarrington
, @drpatelh
, @danielecook
, @escudem
, @crickbabs
We thank the following people for their extensive assistance in the development of this pipeline:
If you would like to contribute to this pipeline, please see the contributing guidelines.
For further information or help, don't hesitate to get in touch on the Slack #demultiplex
channel (you can join with this invite).
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md
file.
You can cite the nf-core
publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.