SpliceLauncher

SpliceLauncher is a pipeline tool to study the alternative splicing. It works in three steps:

Get a read count matrix from fastq files, by a dedicated RNAseq pipeline (A step in diagram below).
Generate data files used hereafter (B step in diagram below)
Run SpliceLauncher from a read count matrix (C step and furthermore in diagram below).

Table of contents

Repository contents
Prerequisites to install SpliceLauncher
Installing SpliceLauncher
Running the SpliceLauncher tests
- Output directory tree
SpliceLauncher options
Authors
License

Repository contents

dataTest: example of input files
scripts: complementary scripts to run SpliceLauncher

Prerequisites to install SpliceLauncher

The SpliceLauncher pipeline needs to install the following tools and R librairies:

STAR (v2.7 or later)
samtools (v1.3 or later)
BEDtools (v2.17 or later)
R with WriteXLS and Cairo packages
Perl

STAR

Following instruction were from the STAR manual

Get the g++ compiler for linux

sudo apt-get update
sudo apt-get install g++
sudo apt-get install make

Download the latest release and uncompress it

# Get latest STAR source
version="2.7.0c"
wget https://github.com/alexdobin/STAR/archive/${version}.tar.gz
tar -xzf ${version}.tar.gz
cd STAR-${version}

# Alternatively, get STAR source using git
git clone https://github.com/alexdobin/STAR.git

Compile under Linux

# Compile
cd STAR/source
make STAR

Samtools

Download the samtools package at: https://github.com/samtools/samtools/releases/latest

Configure samtools for linux:

cd samtools-1.x
./configure --prefix=/where/to/install
make
make install

For more information, please see the samtools manual

BEDtools

Installation of BEDtools for linux:

wget https://github.com/arq5x/bedtools2/releases/download/v2.25.0/bedtools-2.25.0.tar.gz
tar -zxvf bedtools-2.25.0.tar.gz
cd bedtools2
make

For more information, please see the BEDtools tutorial

Install R libraries

The library WriteXLS allows to save result in xlsx format if you do not want to install it, use the --txtOut option. The library Cairo allows to print result in pdf format if you do not want to install it, do not add --Graphics option. Open the R console:

install.packages("WriteXLS")
install.packages("Cairo")

Installing SpliceLauncher

Download the latest release from of SpliceLauncher source using git

git clone https://github.com/LBGC-CFB/SpliceLauncher
cd ./SpliceLauncher

Singularity image

As the Singularity image config file is not writable, you need to use a local version of the config file SpliceLauncher and all its dependencies are also integrated in a Singularity image:

To build it:

sudo singularity build /path/to/SpliceLauncher.simg /path/to/splicelauncher.recipe

To use it

sudo singularity run /path/to/SpliceLauncher.simg --config /path/to/my_config.cfg --help

Download the reference files

The reference files are the genome (Fasta) and the corresponding annotation file (GFF3):

Reference genome in fasta format
The annotation file in GFF v3 format

Steps:

Download Fasta genome: from RefSeq FTP server or from Gencode.

For example, human hg19 genome file from RefSeq:

#the ftp URL depends on your assembly genome choice
wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.fna.gz
gunzip ./GRCh37_latest_genomic.fna.gz

Download the GFF annotation file, either from RefSeq FTP server or from Gencode.

For example, human hg19 annotation file from RefSeq:

wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gff.gz
gunzip ./GRCh37_latest_genomic.gff.gz
head ./GRCh37_latest_genomic.gff
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build GRCh37.p13
#!genome-build-accession NCBI_Assembly:GCF_000001405.25
#!annotation-date
#!annotation-source
##sequence-region NC_000001.10 1 249250621
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606
NC_000001.10	RefSeq	region	1	249250621	.	+	.	ID=id0;Dbxref=taxon:9606;Name=1;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA
NC_000001.10	BestRefSeq	gene	11874	14409	.	+	.	ID=gene0;Dbxref=GeneID:100287102,HGNC:HGNC:37102;Name=DDX11L1;description=DEAD/H-box helicase 11 like 1;gbkey=Gene;gene=DDX11L1;gene_biotype=misc_RNA;pseudo=true
NC_000001.10	BestRefSeq	transcript	11874	14409	.	+	.	ID=rna0;Parent=gene0;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;Name=NR_046018.2;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1;transcript_id=NR_046018.2
NC_000001.10	BestRefSeq	exon	11874	12227	.	+	.	ID=id1;Parent=rna0;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1;transcript_id=NR_046018.2

[Optional] Convert contig of RefSeq to UCSC chromosome names, you will need to download the assembly report, an example of this report is provide in dataTest folder but it is a truncating example so do not use for your own genome. For url example of an assembly report of GRCh37 opf RefSeq can be dowload from https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_assembly_report.txt Use the --assembly_report of INSTALL mode to launch the convertion. This step usually takes one hour.
[Optionnal] To reduce needed memory, we can also restrict the analysis to the primary assembly, without unplaced contigs:

grep ">" GRCh37_latest_genomic.fna | grep -v "unplaced genomic contig"| grep -v "unlocalized genomic contig" | grep -v "genomic patch"| grep -v "alternate locus" | sed 's/^>//' > chr_names
seqtk subseq GRCh37_latest_genomic.fna chr_names > GRCh37_latest_genomic.sub.fna
cut -f 1 -d ' ' chr_names > chr_names_id
head -n 9 GRCh37_latest_genomic.gff > GRCh37_latest_genomic.sub.gff
grep -f chr_names_id GRCh37_latest_genomic.gff >> GRCh37_latest_genomic.sub.gff

Configure SpliceLauncher with INSTALL mode

SpliceLauncher comes with a ready to use config.cfg file. It contains the paths of software and files used by SpliceLauncher. The INSTALL mode of SpliceLauncher updates this config cfg file. If you define the path to GFF (v3) file and path to the FASTA genome, the INSTALL mode will extract all necessary information from this GFF and indexing the STAR genome. This informations are stored in a BED file that contains the exon coordinates, in a sjdb file that contains the intron coordinates and a text file that contains the details of transcript structures. You need to define where these files will saving by the -O, --output argument

Use INSTALL mode of SpliceLauncher:

cd /path/to/SpliceLauncher/
mkdir ./refSpliceLauncher # Here this folder will contain the reference files used by SpliceLauncher
bash ./SpliceLauncher.sh --runMode INSTALL \
    -O ./refSpliceLauncher \
    --STAR /path/to/STAR \
    --samtools /path/to/samtools \
    --bedtools /path/to/bedtools \
    --gff /path/to/gff \
    --threads < number of thread > \
    --fasta /path/to/fasta

Running the SpliceLauncher tests

The example files are provided in dataTest, with the example data provided in single end RNAseq (1x75pb) on BRCA1 and BRCA2 transcripts:

cd /path/to/SpliceLauncher
bash ./SpliceLauncher.sh --runMode Align,Count,SpliceLauncher -F ./dataTest/fastq/ -O ./testSpliceLauncher/ \
    # Optional \
    -t <number of thread> \
    -m <allowed memory in bits> \
    --Graphics \
    --tmpDir /path/to/tmpDir # path to save tmp file during alignment

Output directory tree

SpliceLauncher/outdir
├── Bam
│   ├── {sample}.Aligned.sortedByCoord.out.bam
|   ├── {sample}.Aligned.sortedByCoord.out.bam.csi
|   ├── {sample}.Aligned.sortedByCoord.out_juncs.bed
|   ├── {sample}.SJ.out.tab
│   └── ...
├── getClosestExons
│   ├── {sample}.Aligned.sortedByCoord.out.count
│   └── ...
├── {run name}_results
│   ├── {run name}_figures_output
│       ├── {run name}_{sample}.pdf
│       └── ...
│   ├── {run name}_outputSpliceLauncher.xlsx
│   └── {run name}.bed
├── {run name}.txt
├── {run name}_report_{run date}.txt

The results of SpliceLauncher analysis are in {run name}_results.

The final results are displayed in the file {run name}_outputSpliceLauncher.xlsx. The scheme of this file is:

Column names	Example	Description
Conca	chr13_32915333_32920963	The junction id (chr_start_end)
chr	chr13	Chromosome number
start	32915333	Genomic coordinate of start junction End if on reverse strand
end	32920963	Genomic coordinate of end junction Start if on reverse strand
strand	+	Strand of the junction ('+': forward; '-':reverse)
Strand_transcript	forward	Strand of transcript
NM	NM_000059	The transcript id according RefSeq nomenclature
Gene	BRCA2	Gene symbol
Sample	2250	Read count
P_Sample	15.25659623	% of relative expression
event_type	SkipEx	The nature of junction: Physio: Natural junction SkipEx: Exon skipping 5AS: Donor splice site shift 3AS: Acceptor splice site shift NoData: Unannotated juntion
AnnotJuncs	∆12	The junction names
cStart	c.6841	Transcriptomic start coordinate of the junction
cEnd	c.6938	Transcriptomic end coordinate of the junction
mean_percent	12.60242	Average in % of relative expression across samples
read_mean	2683.769231	Average of read count across samples
nbSamp	11	Number of time that the junction has been seen in samples
DistribAjust	-	The Distribution of junction expression (Gamma/N.binom)
Significative	NO	If a sample shown an abnormal expression of the junction
filterInterpretation	Unique junction	If a sample had an abnormal expression: Aberrant junction For unmodelized junction, if max expression >1%: Unique junction

SpliceLauncher options

--runMode INSTALL,Align,Count,SpliceLauncher

The runMode defines the steps of analysis with:
- INSTALL: Updates the config.cfg file for SpliceLauncher pipeline
- Align: Generates BAM files from the FASTQ files
- Count: Generates the matrix read count from the BAM files
- SpliceLauncher: Generates final output from the matrix read count

Option for INSTALL mode

-C, --config /path/to/configuration file/

Path to the config.cfg file, only if you want to use your own config file

-O, --output /path/to/output/

Directory to save the reference files (BED, sjdb, txt) and the indexed genome

--STAR /path/to/STAR

Path to the STAR executable

--samtools /path/to/samtools

Path to the samtools executable

--bedtools /path/to/bedtools

Path to the bedtools executable

--gff /path/to/gff file

Path to the GFF file (v3)

--fasta /path/to/fasta

Path to the genome fasta file

--mane /path/to/MANElistFile.txt

List of MANE transcripts, current version donwloaded from https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_1.1/MANE.GRCh38.v1.1.summary.txt.gz

--assembly_report /path/to/GRChXX_latest_assembly_report.txt

Path to assembly report to convert contig in UCSC chromosome names

-t, --threads N

Nb threads used to index the STAR genome

Option for Align mode

-F, --fastq /path/to/fastq/

Repository of the FASTQ files

-O, --output /path/to/output/

Repository of the output files

-p

Processes to paired-end analysis

-t, --threads N

Nb threads used for the alignment

--tmpDir /path/to/tmp directory of files from alignment

path to directory where tempory files will saved during alignment, please point toward empty directory

-g, --genome /path/to/genome

Path to the genome directory, only if you to use a genome directory different of the genome defined in config.cfg file

--STAR /path/to/STAR

Path to the STAR executable, only if you to use a STAR software different of the STAR defined in config.cfg file

--samtools /path/to/samtools

Path to the samtools executable, only if you to use a samtools software different of the samtools defined in config.cfg file

Option for Count mode

-B, --bam /path/to/BAM files

Repository of the BAM folder

-O, --output /path/to/output/

Repository of the output files

--bedtools\t/path/to/bedtools

Path to the bedtools executable, only if you to use a bedtools software different of the bedtools defined in config.cfg file

-b, --BEDannot /path/to/your_annotation_file.bed

Path to exon coordinates file (in BED format), only if you to use exon coordinates different of the coordinates defined in config.cfg file

--fastCount

improve junction count runtime, in addtition increase the specificty but reduce the sensitivity

-p

Processes to paired-end analysis, deprecated if use --fastCount option

--samtools /path/to/samtools

Path to the samtools executable, only if you to use a samtools software different of the samtools defined in config.cfg file, deprecated if use --fastCount option

Option for SpliceLauncher mode

-I, --input/path/to/inputFile

Read count matrix (.txt)

-O, --output /path/to/output/

Directory to save the results

--transcriptList /path/to/transcriptList.txt

Set the list of transcripts to use as reference

--txtOut

Print main output in text instead of xls

--bedOut

Get the output in BED format

--Graphics

Display graphics of alternative junctions (Warnings: increase the runtime)

-n, --NbIntervals 10

Nb interval of Neg Binom (Integer)

--min_cov 5

Minimal number of read supporting a junction (Integer)

--SampleNames name1|name2|name3

Sample names, '|'-separated, by default use the sample file names

If list of transcripts (--TranscriptList): --removeOther

Remove the genes with unselected transcripts to improve runtime

If graphics (-g, --Graphics): --threshold 1

Threshold to shown junctions (%)

-R, --RefSeqAnnot /path/to/RefSpliceLauncher.txt

Transcript information file, only if you to use a transcript information file different of file defined in config.cfg file

Authors

Raphael Leman - raphaelleman
- You can contact me at: r.leman@baclesse.unicancer.fr or raphael.leman@orange.fr

Cite as: SpliceLauncher: a tool for detection, annotation and relative quantification of alternative junctions from RNAseq data Raphaël Leman, Valentin Harter, Alexandre Atkinson, Grégoire Davy, Antoine Rousselin, Etienne Muller, Laurent Castéra, Fréderic Lemoine, Pierre de la Grange, Marine Guillaud-Bataille, Dominique Vaur, Sophie Krieger, Bioinformatics

License

This project is licensed under the MIT License - see the LICENSE file for details

LBGC-CFB / SpliceLauncher

SpliceLauncher

Repository contents

Prerequisites to install SpliceLauncher

STAR

Samtools

BEDtools

Install R libraries

Installing SpliceLauncher

Singularity image

Download the reference files

Configure SpliceLauncher with INSTALL mode

Running the SpliceLauncher tests

Output directory tree

SpliceLauncher options

Option for INSTALL mode

Option for Align mode

Option for Count mode

Option for SpliceLauncher mode

Authors

License

About

Languages