ylab-hi / ScanExitron

A computational workflow for exitron splicing identification

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ScanExitron

Build Status

A computational workflow for exitron splicing identification

Prerequisites

You need Python 3.12 to run ScanExitron.

install necessary python packages via anaconda

Install anaconda (python 3.12) firstly, then install dependent packages via conda in bioconda channel.

conda install -c bioconda samtools
conda install -c bioconda bedtools
conda install -c bioconda pyfaidx

Install RegTools v0.4.2. Currently, ScanExitron does not support RegTools >= v0.5.

Prepare the human genome FASTA sequences and annotation GTF file.

# hg38 genome
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz

# hg19 genome
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz
gunzip hg19.fa.gz

# hg38 annotation
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/gencode.v37.annotation.gtf.gz
gunzip gencode.v37.annotation.gtf.gz

# hg19 annotation
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
gunzip gencode.v19.annotation.gtf.gz

# hg38 CDS
cat gencode.v37.annotation.gtf | awk 'OFS="\t" {if ($3=="CDS") {print $1,$4-1,$5,$10,$16,$7}}' | tr -d '";' > gencode.hg38.CDS.bed
# hg19 CDS
cat gencode.v19.annotation.gtf | awk 'BEGIN{OFS="\t"} { if ($3=="CDS") {if ($13=="ccdsid"){print $1,$4-1,$5,$20,$22,$7} else{ print $1,$4-1,$5,$18,$20,$7}}}' | tr -d '";' > gencode.hg19.CDS.bed

configure config.ini file

[fasta]

# reference genome file in FASTA format (absolute path)

hg38=/abs/path/to/hg38.fa
hg19=/abs/path/to/hg19.fa

[annotation]

# gene annotation file in GTF format (absolute path)

hg38=/abs/path/to/gencode.v21.annotation.gtf
hg19=/abs/path/to/gencode.v19.annotation.gtf

[cds]

# CDS annotation in BED format (absolute path)

hg38=/abs/path/to/gencode.hg38.CDS.bed
hg19=/abs/path/to/gencode.hg19.CDS.bed

Usage

Exitron calling using RNA-seq data

ScanExitron.py -i input_rna_seq_bam_file -r [hg38/hg19] -m mapping_quality

Options:

-h, --help            show this help message and exit
-i INPUT, --input INPUT
                        RNA-seq alignment file (BAM/CRAM)
-a AO, --ao AO         AO cutoff (default: 3)
-p PSO, --pso PSO      PSO cutoff (default: 0.05)                        
--mapq                  consider reads with MAPQ >= cutoff (default: 50)		
-r {hg19,hg38}, --ref {hg19,hg38}
                        reference genome (default: hg38)

Input:

input_bam_file      :input RNA-seq BAM/CRAM file. (e.g., rna-seq.bam)
reference_genome    :specify reference genome (hg19 or hg38)

Output:

exitron_file			:Reported exitrons in a TAB-delimited file. (rna-seq.exitron)

Report Columns

Column Name Description
chrom The chromosome of this exitron
start The start position of this exitron in the zero-based, half-open coordinate system
end The stop position of this exitron in the zero-based, half-open coordinate system
name Identifier for the junction
ao Observed supporting reads for exitron
strand The strand the exitron is identified
gene_symbol The Gene symbol of the affected gene
length Length of the exitron
splice_site The two basepairs at the donor and acceptor sites separated by a hyphen
gene_id The Ensembl ID of the affected gene
pso The percent spliced out (PSO) index
psi The percent spliced in (PSI) index
dp The average depth of the exitron
total_junctions The total number of junctions in the sample

We also keep RegTools interim results (rna-seq.janno) for developers.

For a detailed explanation, please refer to The Documentation of RegTools

License

The project is licensed under the MIT license.

Contact

Bug reports or feature requests can be submitted on the ScanExitron Github page.

Citation

Please see and cite our papers at Molecular Cell and STAR protocols.

About

A computational workflow for exitron splicing identification

License:MIT License


Languages

Language:Shell 88.5%Language:Python 11.5%