Pipeline for RNA sequencing best practice analysis at the NGI at Scilifelab Stockholm, Sweden
Written by Phil Ewels (@ewels) and Rickard Hammarén (@Hammarn)
To use this pipeline, you need to have a working version of NextFlow installed. You can find more information about this pipeline tool at nextflow.io. The typical installation of NextFlow looks like this:
curl -fsSL get.nextflow.io | bash
mv ./nextflow ~/bin
Note that if you're running on the Swedish UPPMAX cluster (Milou) you can load NextFlow as an environment module:
module load nextflow
Next, you need to set up a config file so that NextFlow knows how to run and where to find reference
indexes. You can find an example configuration file for UPPMAX (milou) with this repository:
example_uppmax_config
.
Copy this file to ~/.nextflow/config
and edit the line '-A b2013064'
to contain your own UPPMAX project
identifier instead.
It is entirely possible to run this pipeline on other clusters - just note that you may need to customise
the process
environment (eg. if you're using a cluster system other than SLURM) and the paths to reference
files.
This pipeline itself needs no installation - NextFlow will automatically fetch it from GitHub when run if
SciLifeLab/NGI-RNAseq
is specified as the pipeline name.
If you prefer, you can download the files yourself from GitHub and run them directly:
git clone https://github.com/SciLifeLab/NGI-RNAseq.git
nextflow run NGI-RNAseq/main.nf
The typical command for running the pipeline is as follows:
nextflow run SciLifeLab/NGI-RNAseq --reads '*_R{1,2}.fastq.gz' --genome 'GRCm38'
or using a more manual approach ( require you to clone the git repository)
nextflow path_to_NGI-RNAseq/main.nf -c path_to_your_nextflow_config --reads '*_R{1,2}.fastq.gz' --genome 'GRCm38'
Note that the pipeline will create files in your working directory:
work # Directory containing the nextflow working files
results # Finished results for each sample, one directory per pipeline step
.nextflow_log # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.
Location of the input FastQ files:
--reads 'path/to/data/sample_*_{1,2}.fastq'
NB: Must be enclosed in quotes!
Note that the {1,2}
parentheses are required to specify paired end data. Running --reads '*.fastq'
will treat
all files as single end. The file path should be in quotation marks to prevent shell glob expansion.
If left unspecified, the pipeline will assume that the data is in a directory called data
in the working directory.
The reference genome to use of the analysis, needs to be one of the genome specified in the config file.
The human GRCh37
genome is set as default.
--genome 'GRCm38'
The example_uppmax_config
file currently has the location of references for GRCh37
(Human), GRCm38
(Mouse)
and sacCer2
(Yeast).
Used to turn of the edgeR MDS and heatmap, which require at least three samples to work. I.e use this when running on one or two sample only.
Some RSeQC jobs need to know the stranded nature of the library. By default, the pipeline will use
++,--
for single end libraries and 1+-,1-+,2++,2--
for paired end libraries. These codes are for
strand specific libraries (antisense). 1+-,1-+,2++,2--
decodes as:
- Reads 1 mapped to
+
=> parental gene on+
- Reads 1 mapped to
-
=> parental gene on-
- Reads 2 mapped to
+
=> parental gene on-
- Reads 2 mapped to
-
=> parental gene on+
Use this parameter to override these defaults. For example, if your data is paired end and strand specific, but same-sense to the reference, you could run:
nextflow run NGI-RNAseq/main.nf --strandRule '1++,1--,2+-,2-+'
Use --strandRule 'none'
if your data is not strand specific.
Specify the path to a specific config file (this is a core NextFlow command). Useful if using different UPPMAX projects or different sets of reference genomes.