De Novo Transcriptome Assembly

Biocore's de novo transcriptome assembly workflow based on Nextflow

Installation

sh INSTALL.sh it will check the presence of Nextflow in your path, the presence of singularity and will download the BioNextflow library and information about the tools used.

You need either Singularity or Docker to launch the pipeline.

Nextflow version

NXF_VER=0.29.0 nextflow run

Running the pipelines

You can run each pipeline by just using this command

NXF_VER=0.29.0 nextflow run NAME OF THE PIPELINE -bg > log.txt

For example

NXF_VER=0.29.0 nextflow run denovo_assembly.nf -bg > log.txt

You can change the parameters by editing the params.config file or using two - for replacing a particular pipeline parameter.

NXF_VER=0.29.0 nextflow run denovo_assembly.nf -bg --output ./myoutput > log.txt

Module denovo_assembly

This module allows to perform de novo assembly and to retrieve both predicted transcripts and proteins.

╔╗ ┬┌─┐┌─┐┌─┐┬─┐┌─┐╔═╗╦═╗╔═╗  ╔╦╗┬─┐┌─┐┌┐┌┌─┐┌─┐┬─┐┬┌─┐┌┬┐┌─┐┌┬┐┌─┐  ╔═╗┌─┐┌─┐┌─┐┌┬┐┌┐ ┬ ┬ ┬
╠╩╗││ ││  │ │├┬┘├┤ ║  ╠╦╝║ ╦   ║ ├┬┘├─┤│││└─┐│  ├┬┘│├─┘ │ │ ││││├┤   ╠═╣└─┐└─┐├┤ │││├┴┐│ └┬┘
╚═╝┴└─┘└─┘└─┘┴└─└─┘╚═╝╩╚═╚═╝   ╩ ┴└─┴ ┴┘└┘└─┘└─┘┴└─┴┴   ┴ └─┘┴ ┴└─┘  ╩ ╩└─┘└─┘└─┘┴ ┴└─┘┴─┘┴ 
                                                                                
====================================================
BIOCORE@CRG Transcriptome Assembly - N F  ~  version 0.1
====================================================
pairs                               : ../test_data/*_{1,2}.fq.gz
email                               : YOUREMAIL@YOURDOMAIN
minsize (after filtering)           : 70
genetic code                        : Universal
strangeness                          : RF
output (output folder)              : output
minProtSize (minimum protein sized) : 100

Module RABT_assembly

This module allows to perform de reference annotation based transcript (RABT) assembly and to retrieve both predicted transcripts and proteins.

╔╗ ┬┌─┐┌─┐┌─┐┬─┐┌─┐╔═╗╦═╗╔═╗  ╔╦╗┬─┐┌─┐┌┐┌┌─┐┌─┐┬─┐┬┌─┐┌┬┐┌─┐┌┬┐┌─┐  ╔═╗┌─┐┌─┐┌─┐┌┬┐┌┐ ┬ ┬ ┬
╠╩╗││ ││  │ │├┬┘├┤ ║  ╠╦╝║ ╦   ║ ├┬┘├─┤│││└─┐│  ├┬┘│├─┘ │ │ ││││├┤   ╠═╣└─┐└─┐├┤ │││├┴┐│ └┬┘
╚═╝┴└─┘└─┘└─┘┴└─└─┘╚═╝╩╚═╚═╝   ╩ ┴└─┴ ┴┘└┘└─┘└─┘┴└─┴┴   ┴ └─┘┴ ┴└─┘  ╩ ╩└─┘└─┘└─┘┴ ┴└─┘┴─┘┴ 
                                                                                
====================================================
BIOCORE@CRG Transcriptome Assembly - N F  ~  version 0.1
====================================================
pairs                               : ../test_data2/*_{1,2}.fq.gz
genome                              : ../anno/GRCh38.p12.genome.fa.g
z
annotation                          : ../anno/gencode.v30.annotation.gtf
minsize (after filtering)           : 40
genetic code                        : Universal
output (output folder)              : output
minProtSize (minimum protein sized) : 100
strandness                          : RF
maxIntron                           : 10000
email                               : YOUREMAIL@YOURDOMAIN

Module annotation

This module allows to annotate predicted proteins and transcripts from one of the two assembly modules described before.

╔╗ ┬┌─┐┌─┐┌─┐┬─┐┌─┐╔═╗╦═╗╔═╗  ╔╦╗┬─┐┌─┐┌┐┌┌─┐┌─┐┬─┐┬┌─┐┌┬┐┌─┐┌┬┐┌─┐  ╔═╗┌─┐┌─┐┌─┐┌┬┐┌┐ ┬ ┬ ┬
╠╩╗││ ││  │ │├┬┘├┤ ║  ╠╦╝║ ╦   ║ ├┬┘├─┤│││└─┐│  ├┬┘│├─┘ │ │ ││││├┤   ╠═╣└─┐└─┐├┤ │││├┴┐│ └┬┘
╚═╝┴└─┘└─┘└─┘┴└─└─┘╚═╝╩╚═╚═╝   ╩ ┴└─┴ ┴┘└┘└─┘└─┘┴└─┴┴   ┴ └─┘┴ ┴└─┘  ╩ ╩└─┘└─┘└─┘┴ ┴└─┘┴─┘┴ 
                                                                                
====================================================
BIOCORE@CRG Transcriptome Annotation - N F  ~  version 0.1
====================================================
peptide sequences                   : ../assembly/output/Assembly/lon
gest_orfs.pep
cds sequences                       : ../assembly/output/Assembly/lon
gest_orfs.cds
annotation in gff3                  : ../assembly/output/Assembly/longest_orfs.gff3
transcripts                         : ../assembly/output/Assembly/Trinity.fasta
email                               : YOUREMAIL@YOURDOMAIN
genetic code                        : Universal
output (output folder)              : output
diamondDB (uniprot or uniRef90)     : /nfs/db/uniprot/2018_10/knowledgebase/complete/blast/db/uniprot_sprot.fasta
pfamDB (pfam database path)         : /nfs/db/pfam/Pfam31.0/Pfam-A.hmm
minProtSize (minimum protein sized) : 100
batch_diam                          : 5000
batch_pfam                          : 2000

Module quantify

This module allows the quantification of predicted genes obtained from one of the two assembly modules described before.

╔╗ ┬┌─┐┌─┐┌─┐┬─┐┌─┐╔═╗╦═╗╔═╗  ╔╦╗┬─┐┌─┐┌┐┌┌─┐┌─┐┬─┐┬┌─┐┌┬┐┌─┐┌┬┐┌─┐  ╔═╗┌─┐┌─┐┌─┐┌┬┐┌┐ ┬ ┬ ┬
╠╩╗││ ││  │ │├┬┘├┤ ║  ╠╦╝║ ╦   ║ ├┬┘├─┤│││└─┐│  ├┬┘│├─┘ │ │ ││││├┤   ╠═╣└─┐└─┐├┤ │││├┴┐│ └┬┘
╚═╝┴└─┘└─┘└─┘┴└─└─┘╚═╝╩╚═╚═╝   ╩ ┴└─┴ ┴┘└┘└─┘└─┘┴└─┴┴   ┴ └─┘┴ ┴└─┘  ╩ ╩└─┘└─┘└─┘┴ ┴└─┘┴─┘┴ 
                                                                                
====================================================
BIOCORE@CRG Transcriptome Quantification - N F  ~  version 0.1
====================================================
pairs                               : ../test_data/*_{1,2}.fq.gz
transcripts                         : ../assembly/output/Assembly/Trinity.fasta
transmap                            : ../assembly/output/Assembly/Trinity.fasta.gene_trans_map
output                              : output
email                               : YOUREMAIL@YOURDOMAIN

biocorecrg / transcriptome_assembly