- Minimum python version -
python v3.6
regtools
- not required to run the script per se, but it is recommended to use regtools extracted junction files. For instance, we used this command to extract our junction files from bamsregtools junctions extract -a 8 -i 50 -I 500000 bamfile.bam -o outfile.junc
. See detailed regtools documentations here.
- Make sure intron junction annotation files are BED formatted (0 based left close right open). This is different from the standard leafcutter.
- leafcutter2 outputs BED formatted coordinates.
- key differences from leafcutter:
- leafcutter2 use the same set of filters to construct intron clusters.
- However, when counting junction reads towards predefined or on-demand-run intron clusters, no read filter is applied, essentially all junction reads are counted towards introns.
- The filtering options (--MINCLUREADS, --MINREADS, --MINCLURATIO) are only used if you are not providing a pre-defined set of intron clusters. They are not used against counting junction reads towards introns.
This script takes junction files produced by regtools as input and constructs intron clusters from them. Then, it processes the intron clusters to identify rarely spliced introns based on certain filtering cut-offs. The output is a text file that follows the same format as the standard leafcutter tool. The first column of each row shows the genome coordinates of introns and labels them as N[: noisy] or F[: functional/productive].
Main input files:
- junction files, processed using regtools extract junctions
.
Main output files:
-
{out_prefix}_perind.counts.noise.gz
: output functional introns (intact), and noisy introns. Note the start and end coordinates of noisy introns are merged to the min(starts) and max(ends) of all functional introns within cluster. -
{out_prefix}_perind_numers.counts.noise.gz
: same as above, except write numerators. -
{out_prefix}_perind.counts.noise_by_intron.gz
: same as the first output, except here noisy introns' coordinates are kept as their original coordinates. This is useful for diagnosis purposes.
Recommended parameters for running the script:
python scripts/leafcutter2_regtools.py \
-j junction_file_list_as_text_file \
-N intron_junction_annotation_file \
-o leafcutter2 \
-r dir_to_house_leafcutter2_output
This mode first generate intron clusters based on the junction files. Then it counts junction reads towards each classified introns.
junction_file_list_as_text_file
should be a text file listing path to each junction file, one path per lineintron_junction_annotation_file
is an annotation file for introns. A gencode based annotation file is provided atdata/gencode_v43_plus_v37_productive.intron_by_transcript_BEDlike.txt.gz
. Users can provide their own annotation file, but the format need to follow exactly, with last column being 'functional' or 'productive', and coordinates follow BED format. Users may also provide multiple annotation files, with file paths delimited by ",".-r
specify the directory of output, while-o
specify the prefix of output file names (not including directory name)
In some cases, perhaps you want to first generate intron clusters, then run multiple rounds of noisy counts towards the
same set of intron clusters, then you can first generate intron clusters. The script scripts/leafcutter_make_clusters.py
is a helpful script to make intron clusters separately. For instance, you may first run the clustering script to generate
intron clusters (filename: leafcutter_refined_noisy):
python scripts/leafcutter_make_clusters.py \
-j junction_file_list_as_text_file \
-o leafcutter2 \
-r dir_to_house_leafcutter2_output
Then run the leafcutter2 script to generate leafcutter2 outputs:
python scripts/leafcutter2_regtools.py \
-j junction_file_list_as_text_file \
-N intron_junction_annotation_file \
-o leafcutter2 \
-r dir_to_house_leafcutter2_output \
-c leafcutter2_refined_noisy
python scripts/leafcutter2_regtools.py -h
usage: leafcutter2_regtools.py [-h] -j JUNCFILES
[-o OUTPREFIX] [-q] [-r RUNDIR]
[-l MAXINTRONLEN] [-m MINCLUREADS]
[-M MINREADS] [-p MINCLURATIO]
[-c CLUSTER] [-k] [-C]
[-N NOISECLASS] [-f OFFSET] [-T]
optional arguments:
-h, --help show this help message and exit
-j JUNCFILES, --juncfiles JUNCFILES
text file with all junction files to be processed
-o OUTPREFIX, --outprefix OUTPREFIX
output prefix (default leafcutter)
-q, --quiet don't print status messages to stdout
-r RUNDIR, --rundir RUNDIR
write to directory (default ./)
-l MAXINTRONLEN, --maxintronlen MAXINTRONLEN
maximum intron length in bp (default 100,000bp)
-m MINCLUREADS, --minclureads MINCLUREADS
minimum reads in a cluster (default 30 reads)
-M MINREADS, --minreads MINREADS
minimum reads for a junction to be considered for
clustering(default 5 reads)
-p MINCLURATIO, --mincluratio MINCLURATIO
minimum fraction of reads in a cluster that support a
junction (default 0.001)
-c CLUSTER, --cluster CLUSTER
refined cluster file when clusters are already made
-k, --checkchrom check that the chromosomes are well formated e.g. chr1,
chr2, ..., or 1, 2, ...
-C, --includeconst also include constitutive introns
-N NOISECLASS, --noise NOISECLASS
Use provided intron_class.txt.gz to help identify noisy
junction. Multiple annotation files delimited by `,`.
-f OFFSET, --offset OFFSET
Offset sometimes useful for off by 1 annotations.
(default 0)
-T, --keeptemp keep temporary files. (default false)
The example
directory includes a snakemake noisy splicing QTL calling pipeline.
To run snakemake
, change directory to examples
. Run:
snakemake -c1 results/noisy/leafcutter_perind.counts.noise.gz