This repository analyzes PacBio transcriptomic long-read datasets extracted from SRA using Scallop-LR, Iso-Seq Analysis, and StringTie. Scallop-LR is our released long-read transcript assembler, and StringTie is a leading short-read transcript assembler which can also assemble long reads. Iso-Seq Analysis is a software system developed by PacBio that takes subreads as input and outputs polished consensus isoforms (transcripts). The predicted transcripts from Iso-Seq Analysis, Scallop-LR, and StringTie are evaluated using multiple evaluation methods Gffcompare, SQANTI, rnaQUAST, and Transrate.
In most of the PacBio datasets in SRA, one BioSample has multiple SRA Runs because the experimenters used multiple "movies" to increase the coverage so that low-abundance, long isoforms can be captured in analysis. The experimenters also used a "size selection" sequencing strategy, and thus different SRA Runs are designated for different size ranges. Therefore, we use one BioSample instead of one SRA Run to represent one dataset in our analysis, and we merge multiple SRA Runs that belong to the same BioSample into that dataset.
The following are the 26 datasets used in the analysis with their corresponding SRA Study ID's and BioSample ID's. Each dataset corresponds to one BioSample and named by the BioSample ID (except that datasets 15-18 are four replicates for one BioSample). The data can be extracted from SRA, preprocessed and merged into a BioSample-based dataset using the scripts in this repository.
Dataset | BioSample | SRA Study | Organism |
---|---|---|---|
1 | SAMN00001694 | ERP015321 | Homo sapiens |
2 | SAMN00001695 | ERP015321 | Homo sapiens |
3 | SAMN00001696 | ERP015321 | Homo sapiens |
4 | SAMN00006465 | ERP015321 | Homo sapiens |
5 | SAMN00006466 | ERP015321 | Homo sapiens |
6 | SAMN00006467 | ERP015321 | Homo sapiens |
7 | SAMN00006579 | ERP015321 | Homo sapiens |
8 | SAMN00006580 | ERP015321 | Homo sapiens |
9 | SAMN00006581 | ERP015321 | Homo sapiens |
10 | SAMN08182059 | SRP126849 | Homo sapiens |
11 | SAMN08182060 | SRP126849 | Homo sapiens |
12 | SAMN04563763 | SRP071928 | Homo sapiens |
13 | SAMN07611993 | SRP098984 | Homo sapiens |
14 | SAMN04169050 | SRP068953 | Homo sapiens |
15 | SAMN04251426.1 | SRP065930 | Homo sapiens |
16 | SAMN04251426.2 | SRP065930 | Homo sapiens |
17 | SAMN04251426.3 | SRP065930 | Homo sapiens |
18 | SAMN04251426.4 | SRP065930 | Homo sapiens |
19 | SAMEA3374575 | ERP010189 | Mus musculus |
20 | SAMEA3374576 | ERP010189 | Mus musculus |
21 | SAMEA3374577 | ERP010189 | Mus musculus |
22 | SAMEA3374578 | ERP010189 | Mus musculus |
23 | SAMEA3374579 | ERP010189 | Mus musculus |
24 | SAMEA3374580 | ERP010189 | Mus musculus |
25 | SAMEA3374581 | ERP010189 | Mus musculus |
26 | SAMEA3374582 | ERP010189 | Mus musculus |
We have both human and mouse datasets, and they are grouped under human/
and mouse/
directories. Under them,
there is a directory for each SRA Study (named by the SRA Study ID). Under each SRA Study directory, there are
a set of directories for all the SRA Runs that belong to this SRA Study, each is named by an SRA Run ID. There
is also a directory called BioSamples/
. Under BioSamples/
, there are a set of directories for all the
BioSamples that belong to this SRA Study, each is named by a BioSample ID. Each BioSample directory is dedicated
to a BioSample-based dataset. Each BioSample directory should have a file called SRA_Runs
that contains all
the SRA Run ID's that belong to this BioSample, and each line of this file is one SRA Run ID.
Please use the above directory structure for datasets, for running the scripts in this repository for analysis.
The following are the tools/programs that are used in the analysis and their corresponding versions:
Tool/Program | Version |
---|---|
Iso-Seq Analysis | Iso-Seq2 from SMRT Link v5.1.0. |
Minimap2 | v2.2. |
StringTie | v1.3.2d. |
Scallop-LR | v0.9.1. |
Gffcompare | v0.9.9c. |
SQANTI | v1.2. |
rnaQUAST | v.1.5.1. |
Transrate | v1.0.3. |
GMAP | version 2017-09-30. |
gffread | |
gtfcuff | |
bamkit | |
bioawk | awk version 20110810 |
You need to download and compile these tools/programs. Please ensure the binaries of these tools/programs are in your $PATH. All binaries are expected to be in $PATH by the scripts of this repository.
Note: the Scallop-LR version used by the scripts here is v0.9.1. In Scallop-LR v0.9.1, the binary name is scallop
which is the same as the binary name of the short-read assembler Scallop.
So please ensure that in your $PATH, the found binary scallop
is from Scallop-LR rather than from the short-read assembler Scallop (you can check the version by scallop --version
and it should return isoseq-v0.9.1
).
Later on, if you want to use the newer version Scallop-LR v0.9.2, the binary name is changed to scallop-lr
from v0.9.2.
The following are the reference genomes, reference annotations, and reference transcriptomes (from Ensembl) that are used in the analysis:
Reference | Human | Mouse |
---|---|---|
reference genome | GRCh38 | GRCm38 |
reference annotation | Homo_sapiens.GRCh38.90.gtf | Mus_musculus.GRCm38.92.gtf |
reference transcriptome | Homo_sapiens.GRCh38.cdna.all.fa | Mus_musculus.GRCm38.cdna.all.fa |
Please replace the locations of the reference genomes, reference annotations, reference transcriptomes,
gene database, gmap db, and gmap reference sets (for Iso-Seq) in the scripts of this repository by your
actual locations of them. The lines of code that need to be changed are marked by # REPLACE WITH YOUR ACTUAL PATH TO THE REFERENCE DATA
in the scripts.
To prepare gmap db and gmap reference sets (for Iso-Seq), use the following command:
fasta-to-gmap-reference <reference-fasta-file> <output-dir> <name>
Where <reference-fasta-file>
is the full-path reference genome FASTA file; <output-dir>
is the output directory for gmap db and gmap reference set;
<name>
is the sub-directory name for the output GmapReferenceSet XML file. The gmap reference set XML file gmapreferenceset.xml
and gmap_db
will
be created under <output-dir>/<name>/
. This command requires the gmap_build
executable in $PATH.
-
Pre-Process each SRA Run that belongs to this BioSample:
-
Under the SRA Study directory, create a directory named by this SRA Run ID.
In this SRA Run directory, download the hdf5 of this SRA Run from SRA and untar the hdf5.
-
Copy
run_bax2bam.sh
from this repository to the current SRA Run directory.Edit
run_bax2bam.sh
according to the current SRA Run ID and movie (bax.h5 files).Run
run_bax2bam.sh
to convert bax.h5 files to subreads bam files:run_bax2bam.sh &
-
Create bai index files for subreads bam files:
create_bai.sh &
-
-
Run Iso-Seq Analysis for this BioSample:
-
Under the
BioSamples/
directory of the SRA Study directory, create a directory named by this BioSample ID.In this BioSample directory, create the
SRA_Runs
file. TheSRA_Runs
file should contain all the SRA Run ID's of this BioSample, and each line is one SRA Run ID. -
Create the merged dataset from all the SRA Runs of this BioSample and perform Iso-Seq full-analysis:
biosample_isoseq.sh <BioSample_ID> <Top_dir> <Organism> &
-
Post analysis, including Gffcompare and Transrate on final isoforms (after the Iso-Seq full-analysis is successfully completed):
post_isoseq_analysis.sh <BioSample_ID> <Organism> <Install_dir> &
-
-
Run Minimap2 + Scallop-LR for this BioSample, and run Gffcompare and Transrate on the predicted transcripts:
minimap2_scallop_isoseq_allreads_pipeline.sh <Full_path_run_dir> <Organism> <Install_dir> &
-
Run Minimap2 + StringTie for this BioSample, and run Gffcompare and Transrate on the predicted transcripts:
minimap2_stringtie_1_allreads_pipeline.sh <Full_path_run_dir> <Organism> <Install_dir> &
-
Run rnaQUAST on Scallop-LR transcripts, StringTie transcripts, and Iso-Seq isoforms for this BioSample:
rnaQUAST_pipeline.sh <Full_path_run_dir> <Organism> <Merge_dir> &
-
Run SQANTI on Scallop-LR transcripts and Iso-Seq isoforms for this BioSample:
sqanti_pipeline.sh <Full_path_run_dir> <Organism> <Merge_dir> &
The descriptions for the command-line arguments of these scripts can be found at the beginning of each script,
or can be displayed by running the script without providing any command-line argument.
The results of Scallop-LR and StringTie are located under the auto-generated ccs_flnc_and_nfl/minimap2/
directory.
The results of Iso-Seq Analysis are inside the auto-generated final_collapsed_isoforms/
directory.
Note: the scripts for Iso-Seq Analysis are for SMRT Link v5.1.0 and are not compatible with later SMRT Link versions.