dicerna / lrassemblyanalysis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Overview: Long-Read Transcript Assembly Analysis

This repository analyzes PacBio transcriptomic long-read datasets extracted from SRA using Scallop-LR, Iso-Seq Analysis, and StringTie. Scallop-LR is our released long-read transcript assembler, and StringTie is a leading short-read transcript assembler which can also assemble long reads. Iso-Seq Analysis is a software system developed by PacBio that takes subreads as input and outputs polished consensus isoforms (transcripts). The predicted transcripts from Iso-Seq Analysis, Scallop-LR, and StringTie are evaluated using multiple evaluation methods Gffcompare, SQANTI, rnaQUAST, and Transrate.

Datasets and the Directory Structure for Datasets

In most of the PacBio datasets in SRA, one BioSample has multiple SRA Runs because the experimenters used multiple "movies" to increase the coverage so that low-abundance, long isoforms can be captured in analysis. The experimenters also used a "size selection" sequencing strategy, and thus different SRA Runs are designated for different size ranges. Therefore, we use one BioSample instead of one SRA Run to represent one dataset in our analysis, and we merge multiple SRA Runs that belong to the same BioSample into that dataset.

The following are the 26 datasets used in the analysis with their corresponding SRA Study ID's and BioSample ID's. Each dataset corresponds to one BioSample and named by the BioSample ID (except that datasets 15-18 are four replicates for one BioSample). The data can be extracted from SRA, preprocessed and merged into a BioSample-based dataset using the scripts in this repository.

Dataset BioSample SRA Study Organism
1 SAMN00001694 ERP015321 Homo sapiens
2 SAMN00001695 ERP015321 Homo sapiens
3 SAMN00001696 ERP015321 Homo sapiens
4 SAMN00006465 ERP015321 Homo sapiens
5 SAMN00006466 ERP015321 Homo sapiens
6 SAMN00006467 ERP015321 Homo sapiens
7 SAMN00006579 ERP015321 Homo sapiens
8 SAMN00006580 ERP015321 Homo sapiens
9 SAMN00006581 ERP015321 Homo sapiens
10 SAMN08182059 SRP126849 Homo sapiens
11 SAMN08182060 SRP126849 Homo sapiens
12 SAMN04563763 SRP071928 Homo sapiens
13 SAMN07611993 SRP098984 Homo sapiens
14 SAMN04169050 SRP068953 Homo sapiens
15 SAMN04251426.1 SRP065930 Homo sapiens
16 SAMN04251426.2 SRP065930 Homo sapiens
17 SAMN04251426.3 SRP065930 Homo sapiens
18 SAMN04251426.4 SRP065930 Homo sapiens
19 SAMEA3374575 ERP010189 Mus musculus
20 SAMEA3374576 ERP010189 Mus musculus
21 SAMEA3374577 ERP010189 Mus musculus
22 SAMEA3374578 ERP010189 Mus musculus
23 SAMEA3374579 ERP010189 Mus musculus
24 SAMEA3374580 ERP010189 Mus musculus
25 SAMEA3374581 ERP010189 Mus musculus
26 SAMEA3374582 ERP010189 Mus musculus

We have both human and mouse datasets, and they are grouped under human/ and mouse/ directories. Under them, there is a directory for each SRA Study (named by the SRA Study ID). Under each SRA Study directory, there are a set of directories for all the SRA Runs that belong to this SRA Study, each is named by an SRA Run ID. There is also a directory called BioSamples/. Under BioSamples/, there are a set of directories for all the BioSamples that belong to this SRA Study, each is named by a BioSample ID. Each BioSample directory is dedicated to a BioSample-based dataset. Each BioSample directory should have a file called SRA_Runs that contains all the SRA Run ID's that belong to this BioSample, and each line of this file is one SRA Run ID.

Please use the above directory structure for datasets, for running the scripts in this repository for analysis.

Tools/Programs and References Used in Analysis

The following are the tools/programs that are used in the analysis and their corresponding versions:

Tool/Program Version
Iso-Seq Analysis Iso-Seq2 from SMRT Link v5.1.0.
Minimap2 v2.2.
StringTie v1.3.2d.
Scallop-LR v0.9.1.
Gffcompare v0.9.9c.
SQANTI v1.2.
rnaQUAST v.1.5.1.
Transrate v1.0.3.
GMAP version 2017-09-30.
gffread
gtfcuff
bamkit
bioawk awk version 20110810

You need to download and compile these tools/programs. Please ensure the binaries of these tools/programs are in your $PATH. All binaries are expected to be in $PATH by the scripts of this repository.

Note: the Scallop-LR version used by the scripts here is v0.9.1. In Scallop-LR v0.9.1, the binary name is scallop which is the same as the binary name of the short-read assembler Scallop. So please ensure that in your $PATH, the found binary scallop is from Scallop-LR rather than from the short-read assembler Scallop (you can check the version by scallop --version and it should return isoseq-v0.9.1). Later on, if you want to use the newer version Scallop-LR v0.9.2, the binary name is changed to scallop-lr from v0.9.2.

The following are the reference genomes, reference annotations, and reference transcriptomes (from Ensembl) that are used in the analysis:

Reference Human Mouse
reference genome GRCh38 GRCm38
reference annotation Homo_sapiens.GRCh38.90.gtf Mus_musculus.GRCm38.92.gtf
reference transcriptome Homo_sapiens.GRCh38.cdna.all.fa Mus_musculus.GRCm38.cdna.all.fa

Please replace the locations of the reference genomes, reference annotations, reference transcriptomes, gene database, gmap db, and gmap reference sets (for Iso-Seq) in the scripts of this repository by your actual locations of them. The lines of code that need to be changed are marked by # REPLACE WITH YOUR ACTUAL PATH TO THE REFERENCE DATA in the scripts.

To prepare gmap db and gmap reference sets (for Iso-Seq), use the following command:

fasta-to-gmap-reference <reference-fasta-file> <output-dir> <name>

Where <reference-fasta-file> is the full-path reference genome FASTA file; <output-dir> is the output directory for gmap db and gmap reference set; <name> is the sub-directory name for the output GmapReferenceSet XML file. The gmap reference set XML file gmapreferenceset.xml and gmap_db will be created under <output-dir>/<name>/. This command requires the gmap_build executable in $PATH.

Analyze a BioSample-based Dataset with Iso-Seq Analysis, Scallop-LR, and StringTie

  1. Pre-Process each SRA Run that belongs to this BioSample:

    1. Under the SRA Study directory, create a directory named by this SRA Run ID.

      In this SRA Run directory, download the hdf5 of this SRA Run from SRA and untar the hdf5.

    2. Copy run_bax2bam.sh from this repository to the current SRA Run directory.

      Edit run_bax2bam.sh according to the current SRA Run ID and movie (bax.h5 files).

      Run run_bax2bam.sh to convert bax.h5 files to subreads bam files:

      run_bax2bam.sh &
      
    3. Create bai index files for subreads bam files:

      create_bai.sh &
      
  2. Run Iso-Seq Analysis for this BioSample:

    1. Under the BioSamples/ directory of the SRA Study directory, create a directory named by this BioSample ID.

      In this BioSample directory, create the SRA_Runs file. The SRA_Runs file should contain all the SRA Run ID's of this BioSample, and each line is one SRA Run ID.

    2. Create the merged dataset from all the SRA Runs of this BioSample and perform Iso-Seq full-analysis:

      biosample_isoseq.sh <BioSample_ID> <Top_dir> <Organism> &
      
    3. Post analysis, including Gffcompare and Transrate on final isoforms (after the Iso-Seq full-analysis is successfully completed):

      post_isoseq_analysis.sh <BioSample_ID> <Organism> <Install_dir> &
      
  3. Run Minimap2 + Scallop-LR for this BioSample, and run Gffcompare and Transrate on the predicted transcripts:

    minimap2_scallop_isoseq_allreads_pipeline.sh <Full_path_run_dir> <Organism> <Install_dir> &
    
  4. Run Minimap2 + StringTie for this BioSample, and run Gffcompare and Transrate on the predicted transcripts:

    minimap2_stringtie_1_allreads_pipeline.sh <Full_path_run_dir> <Organism> <Install_dir> &
    
  5. Run rnaQUAST on Scallop-LR transcripts, StringTie transcripts, and Iso-Seq isoforms for this BioSample:

    rnaQUAST_pipeline.sh <Full_path_run_dir> <Organism> <Merge_dir> &
    
  6. Run SQANTI on Scallop-LR transcripts and Iso-Seq isoforms for this BioSample:

    sqanti_pipeline.sh <Full_path_run_dir> <Organism> <Merge_dir> &
    

The descriptions for the command-line arguments of these scripts can be found at the beginning of each script, or can be displayed by running the script without providing any command-line argument. The results of Scallop-LR and StringTie are located under the auto-generated ccs_flnc_and_nfl/minimap2/ directory. The results of Iso-Seq Analysis are inside the auto-generated final_collapsed_isoforms/ directory.

Note: the scripts for Iso-Seq Analysis are for SMRT Link v5.1.0 and are not compatible with later SMRT Link versions.

About

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Shell 66.5%Language:R 26.2%Language:Python 7.3%