HISAT2
StringTie
LncFinder
CPAT
Diamond
git clone https://github.com/DaehwanKimLab/hisat2.git
cd hisat2
make
wget https://github.com/gpertea/stringtie/releases/download/v2.1.4/stringtie-2.1.4.Linux_x86_64.tar.gz
tar xzf stringtie-2.1.4.Linux_x86_64.tar.gz
export PATH=$PATH:/path/stringtie-2.1.6.Linux_x86_64/
R Package
install.packages("LncFinder")
install.packages("seqinr")
CPAT (version 1.2.4)
conda create -n py27 python=2.7 -y
source activate py27
pip2 install CPAT
conda install -c bioconda diamond
git clone https://github.com/tderrien/FEELnc.git
export FEELNCPATH=/path/FEELnc/bin/
export PERL5LIB=$PERL5LIB:/path/FEELnc/lib/
export PATH=$PATH:/path/FEELnc/scripts/
hisat2-build -p 8 genome.fasta genome.index
cat sample.list
SRR3087771
SRR3087772
SRR3087773
SRR3087774
...
If the RNA-seq library is strand-specific (fastq files are clean files filtered by fastp)
for i in `cat sample.list`; do hisat2 --new-summary --rna-strandness RF -p 10 -x genome.index -1 ${i}_1.fastq -2 ${i}_2.fastq -S ${i}.sam; done
If the RNA-seq library is not strand-specific
for i in `cat sample.list`; do hisat2 --new-summary -p 10 -x genome.index -1 ${i}_1.fastq -2 ${i}_2.fastq -S ${i}.sam; done
for i in `cat sample.list`; do samtools sort -o ${i}.bam ${i}.sam; done
for i in `cat sample.list`; do stringtie -p 10 --rf -G genome.gtf -o ${i}.gtf ${i}.bam; done
stringtie --merge -o candidate_transcript.gtf -G genome.gtf SRR*.gtf
gffread -w candidate_transcript.fasta -g genome.fasta candidate_transcript.gtf
grep '>' candidate_transcript.fasta | awk '{print $1}' | sed 's/>//g' | sort -u > candidate_transcript.txt
FEELnc_filter.pl -i candidate_transcript.gtf -a genome.gtf --monoex=-1 -s 200 -p 20 > candidate_lncRNA.gtf
cut -d ";" -f 2 candidate_lncRNA.gtf |sed 's/ transcript_id //g' | sed 's/"//g' | sort -u > candidate_lncRNA.txt
R Package
R
library(LncFinder)
library(seqinr)
import training data
mRNA <- seqinr::read.fasta(file ="example_data/training_mRNA.fasta")
lncRNA <- seqinr::read.fasta(file ="example_data/training_lncRNA.fasta")
Use "make_frequencies" function to generate the feature file.
frequencies <- make_frequencies(cds.seq = mRNA, lncRNA.seq = lncRNA, SS.features = FALSE, cds.format = "DNA", lnc.format = "DNA", check.cds = TRUE, ignore.illegal = TRUE)
loading the model
plant = readRDS("./Model/Plant_model.rda")
Identification of lncRNA
Seqs <- seqinr::read.fasta(file ="candidate_transcript.fasta")
Plant_results <- LncFinder::lnc_finder(Seqs, SS.features = FALSE, format = "DNA", frequencies.file = frequencies, svm.model = plant, parallel.cores = 2)
Export results
write.table(Plant_results, file ="plant-lncFinder.txt", sep ="\t",row.names =TRUE, col.names =TRUE,quote =FALSE)
The coding probability (CP) cutoff: 0.46 (CP >=0.46 indicates coding sequence, CP < 0.46 indicates noncoding sequence).
source activate py27
cpat.py -x ./Model/Plant_Hexamer.tsv -d ./Model/Plant.logit.RData -g candidate_transcript.fasta -o CPAT_plant.output
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gunzip uniprot_sprot.fasta.gz
diamond makedb --in uniprot_sprot.fasta -d uniprot_out
diamond blastx -d uniprot_out -q candidate_transcript.fasta -o uniprotoutput.txt
5.5. By intersecting the results obtained from the aforementioned steps, a set of high-confidence lncRNAs were obtained.
The id of the lncRNA
Rscript insersection.sh candidate_lncRNA.txt candidate_lncRNA.txt CPAT_plant.output plant-lncFinder.txt uniprotoutput.txt
The lncRNA gtf file
grep -Fwf final_lncRNA_results.txt candidate_transcript.gtf > lncRNA.gtf
Classification result file
FEELnc_classifier.pl -i lncRNA.gtf -a genome.gtf > lncRNA_classes.txt
Antisense_exonic-lncRNA
awk -F '\t' '{if($1==1 && $6 == "antisense" && $10 == "exonic") {print $0}}' lncRNA_classes.txt > LncRNA_antisense_exonic.txt
Intronic-lncRNA
awk -F '\t' '{if($1==1 && $6 == "sense" && $10 == "intronic") {print $0}}' lncRNA_classes.txt > LncRNA_intronic.txt
Upstream-lncRNA
awk -F '\t' '{if($1==1 && $6 == "sense" && $7 == "intergenic" && $8 <= 2000 && $10 == "upstream") {print $0}}' lncRNA_classes.txt > LncRNA_upstream.txt
Downstream-lncRNA
awk -F '\t' '{if($1==1 && $6 == "sense" && $7 == "intergenic" && $8 <= 2000 && $10 == "downstream") {print $0}}' lncRNA_classes.txt > LncRNA_downstream.txt
Intergenic-lncRNA
awk -F '\t' '{if($1==1 && $7 == "intergenic" && $8 > 2000) {print $0}}' lncRNA_classes.txt > LncRNA_intergenic.txt
Bidirectional-lncRNA
awk -F '\t' '{if($1==1 && $6 == "antisense" && $7 == "intergenic" && $8 <= 2000 && $10 == "upstream") {print $0}}' lncRNA_classes.txt > LncRNA_Bidirectional.txt
cat TE.bed
Chr1 15827287 15838845 LTR_Gypsy
Chr1 13085455 13085593 LTR_Copia
Chr1 11181821 11181959 LTR_Copia
Chr1 20699111 20699248 LTR_Copia
...
cat LncRNA.bed
Chr1 11171031 11171031 MSTRG.2781.1
Chr1 12199350 12199350 MSTRG.2973.1
Chr1 13466928 13466928 MSTRG.3115.1
Chr1 13838536 13838536 MSTRG.3127.1
...
TE-lncRNA
bedtools intersect -a lncRNA.bed -b TE.bed -wo | sort -u | awk '{print $1,$2,$3,$4,$6,$7,$8,$9}' | sed 's/ /\t/g' | sed '1iChr\tLncRNA_start\tLncRNA_end\tLncRNA_ID\tTE_start\tTE_end\tTE_ID\tOverlap' > TE_lncRNA_intersect.txt
If you use Plant-LncPipe, please cite:
Xue-Chan Tian, Zhao-Yang Chen, Shuai Nie, Tian-Le Shi, Xue-Mei Yan, Yu-Tao Bao, Zhi-Chao Li, Hai-Yao Ma, Kai-Hua Jia, Wei Zhao, Jian-Feng Mao, Plant-LncPipe: a computational pipeline providing significant improvement in plant lncRNA identification, Horticulture Research, 2024;, uhae041, https://doi.org/10.1093/hr/uhae041