This guide attempts to walk the user through running this pipeline from start to finish. If there are any questions please contact John Vivian (jtvivian@gmail.com). If you find any errors or corrections please feel free to make a pull request. Feedback of any kind is appreciated.
RNA-seq fastqs are combined, aligned, and quantified with 2 different methods (RSEM and Kallisto)
This pipeline produces a tarball (tar.gz) file for a given sample with 3 main subdirectories: Kallisto, RSEM, and QC.
If the pipeline is run with all possible options (fastqc
, bamqc
, etc), the output tar
will have the following structure (once uncompressed), where SAMPLE is the unique name of the sample:
SAMPLE
├── Kallisto
│ ├── abundance.h5
│ ├── abundance.tsv
│ └── run_info.json
├── QC
│ ├── bamQC
│ │ ├── readDist.txt
│ │ ├── readDist.txt_PASS_qc.txt
│ │ ├── rnaAligned.out.md.sorted.geneBodyCoverage.curves.pdf
│ │ └── rnaAligned.out.md.sorted.geneBodyCoverage.txt
│ ├── fastQC
│ │ ├── R1_fastqc.html
│ │ ├── R1_fastqc.zip
│ │ ├── R2_fastqc.html
│ │ └── R2_fastqc.zip
│ └── STAR
│ ├── Log.final.out
│ └── SJ.out.tab
└── RSEM
├── Hugo
│ ├── rsem_genes.hugo.results
│ └── rsem_isoforms.hugo.results
├── rsem_genes.results
└── rsem_isoforms.results
If the user selects options such as save-bam
or wiggle
, additional files will appear in the output directory:
- SAMPLE.sorted.bam OR SAMPLE.sortedByCoord.md.bam if
bamQC
step is enabled. - SAMPLE.wiggle.bg
The output tarball is prepended with the unique name for the sample (e.g. SAMPLE.tar.gz).
This pipeline has been tested on Ubuntu 14.04, 16.04 and Mac OSX, but should also run on other unix based systems.
apt-get
and pip
often require sudo
privilege, so if the below commands fail, try prepending sudo
.
If you do not have sudo
privileges you will need to build these tools from source,
or bug a sysadmin about how to get them (they don't mind).
1. Python 2.7
2. Curl apt-get install curl
3. Docker http://docs.docker.com/engine/installation/
1. Toil pip install toil
2. S3AM pip install --pre s3am (optional, needed for uploading output to S3)
This pipeline needs approximately 50G of RAM in order to run STAR alignment.
The CGL RNA-seq pipeline is now pip installable!
If there is an existing, system-wide installation of Toil, as is the case when using CGCloud,
the pip install toil
step should be skipped and virtualenv should be invoked with --system-site-packages
.
This way the existing Toil installation will be available inside the virtualenv.
To decrease the chance of versioning conflicts, install toil-rnaseq into a virtualenv:
virtualenv ~/toil-rnaseq
source ~/toil-rnaseq/bin/activate
pip install toil
pip install toil-rnaseq
After installation, the pipeline can be executed by typing toil-rnaseq
into the teriminal.
The CGL RNA-seq pipeline requires input files in order to run. These files are hosted on Synapse and can be downloaded after creating an account which takes about 1 minute and is free.
- Register for a Synapse account
- Either download the samples from the website GUI or use the Python API
pip install synapseclient
python
import synapseclient
syn = synapseclient.Synapse()
syn.login('foo@bar.com', 'password')
- Get the RSEM reference (1 G)
syn.get('syn5889216', downloadLocation='.')
- Get the Kallisto index (2 G)
syn.get('syn5886142', downloadLocation='.')
- Get the STAR index (25 G)
syn.get('syn5886182', downloadLocation='.')
Sample tarballs containing fastq pairs can be passed via the command line option --samples
.
Alternatively, many samples can be placed in a manifest file created by using the
toil-rnaseq --generate-manifest
option.
All samples and inputs must be submitted as URLs with support for the following schemas:
http://
, file://
, s3://
, ftp://
.
Samples consisting of tarballs with fastq files inside must follow the file name convention of ending in an
R1/R2 or _1/_2 followed by .fastq.gz
, .fastq
, .fq.gz
or .fq.
.
Type toil-rnaseq
to get basic help menu and instructions
- Type
toil-rnaseq generate
to create an editable manifest and config in the current working directory. - Parameterize the pipeline by editing the config.
- Fill in the manifest with information pertaining to your samples.
- Type
toil-rnaseq run [jobStore]
to execute the pipeline.
Run sample(s) locally using the manifest
toil-rnaseq generate
- Fill in config and manifest
toil-rnaseq run ./example-jobstore
Toil options can be appended to toil-rnaseq run
, for example:
toil-rnaseq run ./example-jobstore --retryCount=1 --workDir=/data
For a complete list of Toil options, just type toil-rnaseq run -h
Run a variety of samples locally
toil-rnaseq generate-config
- Fill in config
toil-rnaseq run ./example-jobstore --retryCount=1 --workDir=/data --samples \ s3://example-bucket/sample_1.tar file:///full/path/to/sample_2.tar https://sample-depot.com/sample_3.tar
star-index: s3://cgl-pipeline-inputs/rnaseq_cgl/ci/starIndex_chr6.tar.gz
kallisto-index: s3://cgl-pipeline-inputs/rnaseq_cgl/kallisto_hg38.idx
rsem-ref: s3://cgl-pipeline-inputs/rnaseq_cgl/ci/rsem_ref_chr6.tar.gz
output-dir: /data/my-toil-run
s3-dir: s3://my-bucket/test/rnaseq
ssec:
gt-key:
wiggle: true
save-bam: true
ci-test:
fwd-3pr-adapter: AGATCGGAAGAG
rev-3pr-adapter: AGATCGGAAGAG
Example with local input files
star-index: file://data/starIndex_chr6.tar.gz
kallisto-index: file://data/kallisto_hg38.idx
rsem-ref: file://data/rsem_ref_chr6.tar.gz
output-dir: /data/my-toil-run
s3-dir: s3://my-bucket/test/rnaseq
ssec:
gt-key:
wiggle: true
save-bam: true
ci-test:
fwd-3pr-adapter: AGATCGGAAGAG
rev-3pr-adapter: AGATCGGAAGAG
To run on a distributed AWS cluster, see CGCloud for instance provisioning,
then run toil-rnaseq run aws:us-west-2:example-jobstore-bucket --batchSystem=mesos --mesosMaster mesos-master:5050
to use the AWS job store and mesos batch system.
I have written an SOP for UCSC's Core Operations group that is available here.
Tool | Version | Description |
---|---|---|
FastQC | 0.11.5 | Obtains quality metrics on each FASTQ input file. |
CutAdapt | 1.9 | Adapter trimming and quality checking by enforcing fastq samples are properly paired. |
STAR | 2.4.2a | Aligns fastq samples to the genome. Produces transcriptome bam for RSEM, and can optionally generate a genome-aligned bam and BigWig files. |
RSEM | 1.2.25 | Performs quantification of RNA-seq data to produces count values for genes and isoforms. |
Kallisto | 0.42.4 | Performs quantification of RNA-seq data to produces counts for isoforms directly from fastq data. |
All tool containers can be found on our quay.io account.
HG38 (no alternative sequences) was downloaded from NCBI.
The PAR locus on the Y chromosome, which has duplicate sequences relative to the X chromosome, were removed. chrY:10,000-2,781,479
chrY:56,887,902-57,217,415. This was a requirement in order to run Kallisto.
This locus is not removed by the pipeline, and was manually removed. To get this manually modified reference
genome, use the s3cmd
tool with the requester-pays
option and download:
s3://cgl-pipeline-inputs/rnaseq_cgl/hg38_no_alt.fa
.
Gencode v23 annotations were downloaded from Gencode. Comprehensive gene annotation (Regions=CHR) GTF was used to generate all additional reference input data.
STAR index was created using the reference genome and annotation file with the following Docker command:
sudo docker run -v $(pwd):/data quay.io/ucsc_cgl/star --runThreadN 32 --runMode genomeGenerate --genomeDir /data/genomeDir --genomeFastaFiles hg38.fa --sjdbGTFfile gencode.v23.annotation.gtf
RSEM reference was created using the reference genome and annotation file with the following Docker command:
sudo docker run -v $(pwd):/data --entrypoinst=rsem-prepare-reference quay.io/ucsc_cgl/rsem -p 4 --gtf gencode.v23.annotation.gtf hg38.fa hg38
Kallisto index was created using the transcriptome and annotation file with the following Docker command:
sudo docker run -v $(pwd):/data quay.io/ucsc_cgl/kallisto index -i hg38.gencodeV23.transcripts.idx transcriptome_hg38_gencodev23.fasta
- FastQC is run with default options
- CutAdapt is run with default options
- Kallisto is run with
bootstraps
set to 100
'--outFileNamePrefix', 'rna',
'--outSAMtype', 'BAM', 'SortedByCoordinate',
'--outSAMunmapped', 'Within',
'--quantMode', 'TranscriptomeSAM',
'--outSAMattributes', 'NH', 'HI', 'AS', 'NM', 'MD',
'--outFilterType', 'BySJout',
'--outFilterMultimapNmax', '20',
'--outFilterMismatchNmax', '999',
'--outFilterMismatchNoverReadLmax', '0.04',
'--alignIntronMin', '20',
'--alignIntronMax', '1000000',
'--alignMatesGapMax', '1000000',
'--alignSJoverhangMin', '8',
'--alignSJDBoverhangMin', '1',
'--sjdbScore', '1'
'--quiet',
'--no-qualities',
'-p', str(cores),
'--forward-prob', '0.5',
'--seed-length', '25',
'--fragment-length-mean', '-1.0',
'--bam', '/data/transcriptome.bam',