- Description
- Pipeline usage
- Pipeline commands
- Pipeline usage
-
- Fill in Excell file
-
- Convert Excell file to Yaml file
-
- pipeline part I : quality analysis
-
- pipeline part II : aligment and counting features
-
- Pipeline commands
- Quality control of the reads and statistics
-
- FastQC tool
-
- FastX tool
-
- Organism length and coverage
-
- Mapping the reads with Bowtie2 and TopHat
-
- Prepare index file (if needed)
-
- Download annotated reference genome from iGenomes
-
- Mapping with TopHat
-
- Counting mapped reads to features (genes, exons, ...)
-
- Counting with featureCounts
-
- Counting with HTSeq
-
- Statistical analysis
- Differentially expressed genes
- Gene Ontology analysis
- Quality control of the reads and statistics
Installations:
-
FastQC as standalone tool or from http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.3_devel.zip (reason see: http://seqanswers.com/forums/showthread.php?t=4846&page=16) to use in terminal pipeline
-
FastX to trim bad bases
-
parallel: on MAC OS use
$ brew install parallel
-
bowtie2
-
subread for featureCounts. Also as R package : Rsubread
Add to your path if needed
An easy to use general pipeline for RNAseq analysis that can quickly be adapted and run semi-independently. A user first has to specify all the parameters in an excell file for the scripts to run. The excell file also contains some general explanation of the different parameters. Next, this excell file can be converted to a yaml configuration (config) file via an R script. This yaml config file can subsequently be used for the quality analysis of the raw reads. TO BE DONE: if necessary trim raw fastq files.
The same yaml config file can then be used for Tophat2 aligment and creation of feature count tables based on featureCount or HTSeq.
TO BE DONE: the resulting count tables can be imported for Differential expression analysis with DESeq2 package in R.
Explanation of the individual commands / tools that can be - or already are - incorporated in the pipeline.
- Quality control: FastQC and FastX
- Mapping and alignment of reads: Bowtie(2) and Tophat2
- Counting mapped reads to features
Put script in your PATH $ echo $PATH
and make them executable $ chmod +x script
- Fill in Excell file
An excell file, that can be used as an example is given. Please do not delete or add additional columns or rows. Only the values of the second column (header = values) can be changed. More information about possible parameters can be found in the 3th column (header = description).
- Convert Excell file to Yaml file
# creates RNAseqConfig.yaml in working directory
$ makeYamlConfigFromXlsx.r /path-to-excell-file/preconfigfile.xls
# creates RNAseqConfig.yaml in designated directory
$ makeYamlConfigFromXlsx.r /path-to-excell-file/preconfigfile.xls /path-for-outputfile/
- pipeline part I : quality analysis
$ fastQC_pipeline.r /path-to-dir/RNAseqConfig.yaml
- pipeline part II : aligment and counting features
$ tophat2_count_pipeline.r /path-to-dir/RNAseqConfig.yaml
Analyze all the .fastq files with the FASTQC standalone tool
In terminal as part of pipeline
$ cd /folder/with/files.fastq
$ mkdir OutputFastQC # create folder to store the fastQC results
$ fastqc -h # help page with all the available parameters
$ fastqc *.fastq --threads 4 --outdir ./OutputFastQC # * for all fastq files, --threads = how many files to analyze at the same time
If FastQC analysis shows that bases at the end have a systematic low score, we can trim them
# example for just the first lines and removal of last 5 bases
$ fastx_timmer -h # help page
$ head file.fastq | fastx_trimmer -t 5 -Q33 # Q33 = use of the sanger/Phred+33 score
$ fastx_trimmer -Q33 -t 5 -i file.fastq -o trimmed_file.fastq # input and output files
$ find *.fastq | parallel "fastx_trimmer -Q33 -t 5 -i {} -o trimmed_{}" # use parallel to do all files at once (for loop would be alternative)
Go to NCBI Genome website and identify your genome / organism of interest
Your number of reads for each sample can be found in the FastQC quality report
Rule of thumb: for human genome (3 Gb) aim for at least 10 million reads
-
open terminal and go to data directory
-
try out bowtie2
$ bowtie2
-
The reference genome must be in specific format (index) for bowtie to be able to use it.
-
Download genome (fasta file) from NCBI Genome website, Ensembl website or UCSC website
-
Build the index for bowtie
$ bowtie2-build index_organism_genome.fasta index_organism_genome
Several pre-built indexes are available to download on the bowtie2 webpage.
iGenomes provides ready-to-use reference sequences and annotations from Ensembl, NCBI and UCSC. Example for mm9.
- download annotations and reference genome to current folder
$ wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Mus_musculus/UCSC/mm9/Mus_musculus_UCSC_mm9.tar.gz
- use 'nohup ... &' , executes commands after you exit from the terminal and/or 'nice', which determines the priority of the command -20 to 19, respctl highest to lowest
$ nohup nice -n 19 wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Mus_musculus/UCSC/mm9/Mus_musculus_UCSC_mm9.tar.gz &
- example to download mm10 to destination folder; takes very long time. Only necessary once on local or server
$ wget -P /Users/ruben/Documents/genomedata/ ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Mus_musculus/UCSC/mm10/Mus_musculus_UCSC_mm10.tar.gz
$ tar -zxvf /Users/ruben/Documents/genomedata/Mus_musculus_UCSC_mm10.tar.gz
TopHat is a splice junction mapper and can map across exon/exon boundaries.
Best to perform mapping on server with adequate power.
$tophat2 -p 20 --no-coverage-search --segment-mismatches 1 --segment-length 18 --GTF /mnt/nfs/data/ruben/iGenomes/Mus_musculus/UCSC/mm9/Annotation/Genes/genes.gtf --library-type fr-unstranded --output-dir s_4_1_I12_1093_01_KO13_b1_d0_tophat2 /mnt/nfs/data/ruben/iGenomes/Mus_musculus/UCSC/mm9/Sequence/Bowtie2Index/genome s_4_1_I12_1093_01_KO13_b1_d0.fastq.gz
Parameters for TopHat(2)
-
Can also be combined with the use of
nohup
and/ornice
-
tophat2 can recognize .fastq.gz files, so no need to convert them first
-
p or --num-threads : number of threads allocated to this command, the more the faster
-
no-coverage search : coverage search makes process really slow, can be switched off if you don't want to discover new splicing or new genes (see biostars)
-
--segment-mismatches : defaults to 2, set to 0 or 1 for short reads (36bp) (see seqanswers)
-
--segment-length : defaults to 25, set to half of read length for short reads (36/2 = 18) (see seqanswers)
-
--GTF : gene annotation file
-
--library-type : depends on the library preparation kit (see tophat FAQ)
-
--output-dir : defaults to tophat_out/ , better and easier to make output folders for file.fastq
-
build genome data
-
fastq or fast.gz file
For Agata's RNAseq data:
$ cd /folder_with_fastq.gz_files/
$ nohup find *fastq.gz | parallel -j 10 "tophat2 -p 20 --no-coverage-search --segment-mismatches 1 --segment-length 18 --GTF /mnt/nfs/data/ruben/iGenomes/Mus_musculus/UCSC/mm9/Annotation/Genes/genes.gtf --library-type fr-unstranded --output-dir {}_tophat2 /mnt/nfs/data/ruben/iGenomes/Mus_musculus/UCSC/mm9/Sequence/Bowtie2Index/genome {}" &
Create Index, required for some downstream analyses (like IGV):
$ find */accepted_hits.bam | parallel "samtools index {}"
Several tools/algorithms/software can be used to count the mapped features:
Here you can find a speed/memory comparison.
On server:
$ featureCounts -a /mnt/nfs/data/ruben/iGenomes/Mus_musculus/UCSC/mm9/Annotation/Genes/genes.gtf -o featC_counts.txt -t exon -g gene_name -T 10 */accepted_hits.bam
Most default settings are ok.
On server:
$ samtools view accepted_hits.bam | htseq-count - /mnt/nfs/data/ruben/iGenomes/Mus_musculus/UCSC/mm9/Annotation/Genes/genes.gtf > sample_accepted_hits.htcounts
# for all accepted_hits.bam, make sure you are in the right folder
$ nohup find */accepted_hits.bam | parallel -j 10 "samtools view {} | htseq-count - /mnt/nfs/data/ruben/iGenomes/Mus_musculus/UCSC/mm9/Annotation/Genes/genes.gtf > {}.HTseqcounts" &
Or one by one with for loop: TO BE TESTED:
$ cd /folder_with_tophatOutput_folders/
# create script
$ nano HTseqForloop.sh
--------------------------
#!/usr/bin/bash
date
for i in `find */accepted_hits.bam`;
do
bam=$i
count="${bam/.bam}/_HTseqcount"
echo -e "Counting"$bam
samtools view $bam | htseq-count - /mnt/nfs/data/ruben/iGenomes/Mus_musculus/UCSC/mm9/Annotation/Genes/genes.gtf > $count
done
date
--------------------------
$ chmod +x HTseqForloop.sh
$ ./HTseqForloop.sh
-
DESeq2
-
EdgeR
-
topGO
-
Gage & Pathview
RESOURCES USED
-
complete and well explained tutorial: http://bioconnector.github.io/workshops/lessons/rnaseq-1day/
-
iGenomes reference genomes: http://support.illumina.com/sequencing/sequencing_software/igenome.html
-
RNAseq with iGenomes: http://www.illumina.com/documents/products/technotes/RNASeqAnalysisTopHat.pdf
-
https://biobeat.wordpress.com/2013/05/16/bowtie2-htseq-count-reference-downloaded-from-ncbi/
-
complete pipeline in R: http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html
-
super blogspot: http://genomespot.blogspot.be/
-
interesting blogspot: http://crazyhottommy.blogspot.be/
-
complete tutorial: http://web.science.uu.nl/pmi/publications/PDF/2013/TiPS-VanVerk-Hickman-2013-Hands_on_Tutorial.pdf