RyanYip-Kat / FLEPSeq

The analysis pipeline for FLEP-seq

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FLEPSeq

The analysis pipeline for FLEP-seq

This toolset can be used to analyze the sequencing data generated either by Nanopore or by PacBio using the FLEP-seq method.

The example scripts for ploting: https://nbviewer.jupyter.org/github/ZhaiLab-SUSTech/FLEPSeq/blob/master/script/FLEP.figure_example.ipynb

Reference: https://www.nature.com/articles/s41477-020-0688-1

Jia, J., Long, Y., Zhang, H. et al. Post-transcriptional splicing of nascent RNA contributes to widespread intron retention in plants. Nat. Plants 6, 780–788 (2020). https://doi.org/10.1038/s41477-020-0688-1

Software and package requirements

Step by step workflow

All scripts are deposied in scripts directory, and can use --help to view the help information.

Nanopore data analysis pipeline

  1. Nanopore basecalling

You can use MinKNOW to perform real-time basecalling while sequencing, or use the GPU version of Guppy to speed up basecalling after sequencing. Both MinKNOW and Guppy are available via Nanopore community site (https://community.nanoporetech.com). Command-line for running Guppy basecalling is as follow:

$ guppy_basecaller -i raw_fast5_dir -s out_fast5_dir -c dna_r9.4.1_450bps_hac.cfg --recursive --fast5_out --disable_pings --qscore_filtering --device "cuda:all:100%"
  1. Convert FASTQ files to FASTA format
$ python fastqdir2fasta.py --indir path/to/fastq_pass --out all.fasta
  1. Use minimap2 to map reads to reference genome
$ minimap2 -ax splice --secondary=no genome.fasta all.fasta > tmp.sam

CAUTION: For the organisms with short introns, such as Arabidopsis, it might be better to use the parameter “-G” to set the max intron length, for example, “-G 12000”. You also can set “-t number_of_threads” to use more threads to speed up.

$ samtools sort -o mapped.bam tmp.sam
$ samtools index mapped.bam
$ rm tmp.sam
  1. (Optional) Remove rRNA and tRNA derived reads
$ python filter_rRNA_bam.py --inbam mapped.bam --inbed rRNAtRNAetc.bed --out clean.bam
$ samtools index clean.bam
  1. Find 3’ adapter in reads
$ python adapterFinder.py --inbam clean.bam --inseq all.fasta --out adapter.result.txt --threads 36
  1. Identify polyA tail and estimate its length
$ python PolyACaller.py --inadapter adapter.result.txt --summary sequencing_summary.txt --fast5dir fast5_dir --out polyA_tail.result.txt --threads 36
  1. Extract read information

This pipeline will produce a table containing intron retention information and Pol II position.

$ python extract_read_info.py --inbam clean.bam --inbed lib/exon_intron_pos.repr.bed --out read_info.result.txt
  1. Merge the above analysis results
$ Rscript merge_read_info.R --type Nanopore --inreadinfo read_info.result.txt --inadapter adapter.result.txt --inpolya polyA_tail.result.txt --out read.info.txt
  1. Analyze splicing kinetics
$ python prepare_data_for_splice_kinetics.py --inreadinfo read.info.txt --inbed lib/exon_intron_pos.repr.bed --out read.intron.pos.splicing.txt
$ Rscript plot_intron_splicing_kinetics.R --inrelpos read.intron.pos.splicing.txt --inreadinfo read.info.txt --inintron lib/select_introns.txt --out read.splicing_kinetics.txt --pdf read.splicing_kinetics.pdf 
  1. Calculate intron retention ratio of polyadenylated transcripts
$ Rscript cal_polya_transcript_ir.R --inrelpos read.intron.pos.splicing.txt --inreadinfo read.info.txt --outrna mRNA.incompletely_spliced_ratio.txt --outintron intron.unspliced_ratio.txt

PacBio data analysis pipeline

  1. Generate highly accurate single-molecule consensus reads using ccs

It is the most time-consuming step in PacBio data analysis. For the data we used (~12M subreads), it requires ~30 h to generate consensus sequences (Hifi-reads) on Intel Xeon 6140 CPU at 2.3GHz.

$ ccs --num-threads 36 --min-rq 0.9 --report-file ccs.report input.subreads.bam ccs.bam
  1. Remove adapter using lima
$ echo '>primer_3p\nAAGCAGTGGTATCAACGCAGAGTACATTGATGGTGCCTACAG\n>primer_5p\nAAGCAGTGGTATCAACGCAGAGTACATGGG\n' > primer.fasta
$ lima -j 36 ccs.bam primer.fasta lima.bam --isoseq --peek-guess
$ python lima_bam2fasta.py --infile lima.primer_3p--primer_5p.bam --out all.fasta
  1. Use minimap2 to map reads to reference genome, and remove rRNA and tRNA derived reads.

The same as Step 3-4 in Nanopore pipeline.

  1. Identify polyA tail and estimate its length
$ python pacbio_find_polyA.py --inbam clean.bam --inseq all.fasta --out polyA_tail.result.txt
  1. Extract read information.

The same as Step 7 in Nanopore pipeline.

  1. Merge the above analysis results
$ Rscript merge_read_info.R --type PacBio --inreadinfo read_info.result.txt --inpolya polyA_tail.result.txt --out read.info.txt
  1. Analyze splicing kinetics, and calculate intron retention ratio of polyadenylated transcripts.

The same as Step 9-10 in Nanopore pipeline

Illumina data analysis pipeline

  1. Use hisat2 to map reads to reference genome.
$ hisat2 -x /lib_path/ath_hisat2_genome -p 20 --min-intronlen 20 --max-intronlen 12000  --dta  --time  -1 R1.fastq.gz -2 R2.fastq.gz  -S hisat2.sam
samtools sort -o hisat2.bam hisat2.sam
samtools index hisat2.bam
rm hisat2.sam
  1. Use picard to remove PCR duplication reads.
$ java -jar /soft_path/picard.jar MarkDuplicates REMOVE_DUPLICATES=true SORTING_COLLECTION_SIZE_RATIO=0.01 I=hisat2.bam O=hisat2.markdump.bam M=hisat2.markdump.txt
samtools index hisat2.markdump.bam
  1. Calculate the unspliced and spliced ratio of introns.
$ python ASCaller.py -i hisat2.markdump.bam -o irratio.txt --file_intron_pos lib/intron_pos.repr.txt --strand_flag 0 --min_overlap 6

For strand-specific library, you can set “--strand_flag 1” when the right-most end of the fragment is the first sequenced and set “--strand_flag 2” when the left-most end of fragment is the first sequenced.

Output

The most important output is read.info.txt generated by merge_read_info.R. It is a plain file seprated by Tab including these important columns:

read_core_id, mRNA, retention_introns, polya_length, type.

type value:

elongating
splicing_intermediate
elongating_3_mapping_low_accuracy
elongating_5lost
polya
polya_3_not_in_last_exon
polya_5lost

PS: All columns in the read.info.txt file (most of columns are only used for script developing)

read_core_id            de4d183f-9187-43e5-ab03-56ce63e7de88,chr1,7030,8176

#generate by extract_read_info.py

chr                     chr1         #chromatin name 
read_start              7030         #alignment start 1-based
read_end                8176         #alignment end
read_strand             +            #alignment strand
read_exon_total_num     4            #total number of read exon
mRNA                    AT1G01020.1  #mRNA name       
mRNA_intron_num         8            #total intron number of mRNA
mRNA_start              6788                
mRNA_end                9130                
mRNA_strand             -                   
mRNA_length             2343                
mRNA_pos5               954          #the 5' end position of read relative to the 5' end of mRNA (0-based) 
mRNA_pos3               -242         #the 3' end poistion of read relative to the 3' end of mRNA (0-based)
rel_mRNA_pos5           954          #see below. indicate wrong annotation of mapping if rel_mRNA_pos5 != mRNA_pos5
rel_mRNA_pos3           -242         #see below. indicate wrong annotation of mapping if rel_mRNA_pos3 != mRNA_pos3
total_coverage          797          #total coverage length with mRNA
f_read_exon_num         4            #see below
f_feature_type          intron
f_feature_num           3
f_feature_length        248
f_pos5                  59
f_pos3                  226
l_read_exon_num         1
l_feature_type          exon
l_feature_num           9
l_feature_length        282
l_pos5                  0
l_pos3                  -242
end_type                2            #see below
retention_introns       3:4:6        #retained intron, seprated by ":", the intron number is ordered from 5' mRNA to 3' mRNA
retention_intron_num    3            #the total number of retained intron
span_intron_num         6            #the total number of introns the read overlapping (3-8). The first span intron is f_feature_num, the last span intron is l_feature_num - 1.

#generate by PolyACaller.py or pacbio_find_polyA.py
#polya_start_raw, polya_end_raw, polya_score is only generated for Nanopore data
polya_start_raw     	1185         #The raw signal event index of potential polyA region
polya_end_raw       	1924	     #The raw signal event index of potential polyA region
polya_start_base    	73
polya_end_base      	77
polya_length        	83.34
polya_score         	740
polya_start_raw						
polya_end_raw

#generate by adapterFinder.py (only for Nanopore data)
read_align_strand         +            
rna_strand                -            
read_length               839          # read total length
primer_type	              R-F          
genome_align_start        71
genome_align_end          820
primer_score              1
polyA_type                T            # not shown for now
f_primer_type             R
f_primer_start            1
f_align_end               70
r_primer_type             F
r_primer_start            1
r_align_start             821

#generated by merge_read_info.R
end3_alignment_score      -1           # only for Nanopore data
end_polyA_type			  3	
end5ss_type				  0
type					  elongating

some informations used for script developing:

  1. read_core_id

Due to that one read may be mapped to multiple positions, we merge read id and mapping information to represent each alignment:

read_id:      7fab6cbf-a9ed-4427-951f-741515ddba0b
read_core_id: 7fab6cbf-a9ed-4427-951f-741515ddba0b,chr1,7026,8687

Note: Logically, the read_core_id may be also not uniquely, thus need to remove duplication. (At merge step)

  1. read exon

The exon information of each read from bam file based on the 6th column (CIGAR string). Only consider N (represent intron junction) in CIGAR string, not consider indel. Split read to exon by junction.

exon1   intron1    exon2   intron2     exon3   intron3    exon4
>>>>> - - - - - - >>>>>>> - - - - - - >>>>>>>> - - - - - >>>>>>>> gene  (>>>>> for exon, - - - - for intron)
    >>>>>>>>>>>>>>>>>>>>>- - - - - -  >>>>>>>>>>>>>>>>>           read  (CIGAR: 110M60N70M)
      read_exon1(110bp)      (60bp)    read_exon2(70bp)
  1. mRNA_pos5, mRNA_pos3, rel_mRNA_pos5, and rel_mRNA_pos3

We don't care about the alignment direction of read, because we haven't identify the 3' adapter position, thus we don't know the orignal mRNA strand of this read.

gene (length: l)  ----->>>>>>>>>>>>>>>>>>>>>>---------      >>>> repesent gene
                  |    |   |              | |    |
                 -4    0   4             -2 0    5
                  (mRNA pos5)            (mRNA pos3)

    gene          -----<<<<<<<<<<<<<<<<<<<<<<---------
                       |                    |
                  mRNA pos3             mRNA_pos5

rel_mRNA_pos3, rel_mRNA_pos5:

 >>>> exon,  .... intron
mRNA            >>>>>>>>>>>.......>>>>>>>>>         
read >>>......>>>>>>>>>>>>>.......>>>>>>>>>>>>......>>>>>>>>>>
     |        |                              |               |
     |        |                              |               |
  mRNA_pos5   |                              |          mRNA_pos3
          rel_mRNA_pos5                 rel_mRNA_pos3

The inconsistent results (<1% reads in Arabidopsis.) between rel_mRNA_pos and mRNA_pos indicate the wrong mapping or wrong transcription annotation, this pipeline only consider the represent transcripts, it should be removed (in merge step). This is mainly due to alternative splicing or mismapping of polyA You can try set less max intron length during mapping step, but it is not suitable for genome with long introns. And also we can remove the two side mapping read exon based on polyA information.

  1. f_feature, l_feature
                  exon1        intron1          exon2        intron2       exon3
mRNA           >>>>>>>>>>>>...............>>>>>>>>>>>>>>>...............>>>>>>>>>>>
read                >>>>>>>...............>>>>>>>>>>>>>>>>>>>>>>>

f_feature: the 5' end feature overlapping with this read, in this example, it is the first exon: exon1, thus f_feature_type is "exon", f_feature_num is 1, it is the first read exon of the read overlapping with it, thus f_read_exon_num is 1. f_pos5 and f_pos3 is the position of the read exon 1 relative the exon1, the value is like mRNA_pos5 and mRNA_pos3.

l_feature: the 3' end feature overlapping with this read, in this example, it is the second intron: intron2, thus l_feature_type is "intron", l_feature_num is 2, it is the second read exon of the read overlapping with it, thus l_read_exon_num is 2.

  1. intron retention

for each intron overlapping with this read:

(a) if it is the f_feature: it would be retented. but this may be affected by 3' SS.

(b) if it is the l_feature: the transcription hasn't been finished. not retented.

(c) in the inner: if coverage_ratio >= 0.8: retention.

  1. end_type:
mRNA                >>>>>>>>>>>.......>>>>>>>>>.........>>>>>>>>>>      
read  end_type: 0                  >>>>>>>>>>>>>>>
      end_type: 1     >>>>>>>>>>>>>>>>>>>>>>
      end_type: 2                  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      end_type: 3   >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

    #end5: read overlapped with the the first exon
    #end3: read overlapped with the the last exon
    #	     noend5 end5
    #noend3      0    1
    #end3        2    3
  1. primer/adapter
primer_5p = "AAGCAGTGGTATCAACGCAGAGTACATGGG"
primer_3p = "AAGCAGTGGTATCAACGCAGAGTACATTGATGGTGCCTACAG"

primer_type: F means primer_5p, R means primer_3p

f_ indicate the 5' read, r_ indicate the 3' read

Note: f may be F or R, r may be F or R.

genome 5’--------------------------------------------------------3'  
read1      f_primer--------------------------->r_primer
                   ||                          ||
                   1|                          |2
                    3                          4

\b
read2      r_primer<---------------------------f_primer
                   ||                          ||
                   2|                          |1
                    4                          3    
						
1: f_align_end, 2: r_align_start, 3: genome_align_start, 4: genome_align_end 

coordinate of read: f_align_end, r_align_start, genome_align_start, genome_align_end

coordinate of adapter: f_primer_start, r_primer_start

if r_primer_start if 4, the first two base of r_primer is not mapped to read

for example:
                                             AAGGGAGAGAGAG
                                                ||||||||||
read1      f_primer--------------------------->GGAGAGAGAG

one end primer type

(a) no alignment ["N", ["", "", -1, -1, -1, -1, ""]]

(b) If "F" primer, if primer_start > primer_f_max_start: UF

(c) if "R" primer, if primer_start > primer_r_max_start2: UUR

if primer_r_max_start2 >= primer_start > primer_r_max_start1: UR

read adapter type based on the primer types of both ends. for example: F-R indicate the 5' primer is F, the 3' primer is F. Each kind of primer type is assigned a score, and also can predict the rna_strand by primer type, The score 1 or 2 is reliable.

  1. end_type and end_polyA_type:
end_type:
    mRNA                >>>>>>>>>>>.......>>>>>>>>>.........>>>>>>>>>>      
    read  end_type: 0                  >>>>>>>>>>>>>>>
          end_type: 1     >>>>>>>>>>>>>>>>>>>>>>
          end_type: 2                  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
          end_type: 3   >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

        #end5: read overlapped with the the first exon
        #end3: read overlapped with the the last exon
        #	     noend5 end5
        #noend3      0    1
        #end3        2    3
polyA_type:
      0:  no polyA. polyA_length < 15  
 	   1:  has polyA. polyA_length >= 15
end_polyA_type:
	              end_type=0    end_type=1    end_type=2    end_type=3
    polyA_type = 0         0             1             2             3
    polyA_type = 1         4             5             6             7

generated by IRCaller.py

Output: tab-sperated file. including clolumns: intron_id AT1G01020.1_intron4 chr_name chr1 intron_start 7836 # 1-based intron_end 7941 intron_strand - intron_coverage_ratio 1 a 126 b 231 ab 103 c 215 o 20 t 695 iratio 0.545014521 # unspliced ratio of intron sratio 0.416263311 # spliced ratio of intron oratio 0.038722168 # the ratio of alternative splicing o1ratio 0.013552759 o2ratio 0.011616651 o1_type o||1|:7829-7941 o1_count 7 o2_type o|1|0|:7902-8235 o2_count 6 other_o o|1|0|:7907-8235=3;o|1|1|:7650-8235=2;o||1|:7832-7941=1;o|1||:7836-8235=1

a, b, ab, c, o columns store the count of read with specific read type: --------...............--------- gene structure --...............--- c -----............--- o #alternative splicing -------- a ---------- b ------------------------- ab ------- i

t = a + b + c + o + ab iratio = (a + b + 2ab)/(a + b + 2ab + 2o + 2c) sratio = 2c/(a + b + 2ab + 2o + 2c) oratio = 2o/(a + b + 2ab + 2o + 2c)

o: alternative splicing The format of alternative splicing type: o|x1|x2|x3:start1-end1:< start2-end2 > x1, x2 can be one value of "" or "0" or "1" "" means splicing at the annotated splicing site "0" means splicing downstream the 5' splicing site or upstream the 3' splicing site "1" means splicing upstream the 3' splicing site or downstream the 5' splicing site x1 for 5' splicing site x2 for 3' splicing site x3 for exon skipping, can be one value of "" or "e1" or "e2" or ... x3 only care about the exon skipping occuring inner the annotated intron. the nubmer after "e" means the number of skipped exons. start1-end1 means the junction position of the read overlapped with the annotated intron. If one read has multiple junction positions, they are sperated by ":". Considering that some introns have multiple alternative splicing types. The most frequent type is marked as o1, the second frequent type is marked as o2. Their type, count and ratio are recored in o1ratio, o2_ratio, o1_count, o2_count, o1_type, o2_count columns. The inforamtion of all other types are recored in other_o column in type3=count3;type4=cout4 format.

Some exampels of alternative splicing type:

--------...............--------- gene structure ------........---------- o|0|0| 5' inner, 3' inner ------------.....--------- o|0|| 5' inner, 3' right ------------.........---- o|0|1| 5' inner, 3' outer ----......................---- o|1|1| 5' outer, 3' outer

--------...............--------- gene structure ------.....-------------- o||0| 5' right, 3' inner ------.....-----.....------ o|||e1 exon skipping ------.....-----.......----- o||1|e1 5' right, 3' inner, exon skipping ------.....----- o||0|

---........------------- o|1|0| 5' outer, 3' inner ---........-----.....------ o|1||e1 ---........-----.......----- o|1|1|e1 5' outer, 3' outer, exon skipping ---........----- o|1|0|

         -----.....------    o|0|| 
         -----........---    o|0|1|                        

About

The analysis pipeline for FLEP-seq


Languages

Language:Jupyter Notebook 84.6%Language:Python 14.2%Language:R 1.2%