vapolonio / PreProcSEQ

Quality control pipeline and pre-processing of data from RNA-Seq

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PreProcSEQ

Quality control pipeline and pre-processing of data from RNA-Seq

Introduction

RNA-Seq has stood out among sequencing technologies. Since then, the subsequent analysis of the raw data obtained from this technology has gained focus in bioinformatics. This Pipeline aims to present the main steps for the construction of the gene expression matrix, from raw RNA-Seq data.

Among the steps presented in this pipeline, the topics are addressed:

  • quality control

  • trimming

  • transcript quantification

  • annotation of transcripts

  • normalization

  • batch effect removal

Installation of the necessary tools for the execution of the pipeline

Make sure you have installed all the tools the pipeline needs to run:

Tools: FastQC, MultiQC, Trimmomatic, Salmon, Kallisto, R

R packages: tximport, tximeta, GenomicFeatures, ensembldb, SummarizedExperiment, readxl, AnnotationHub, stringr, edgeR, sva, magrittr

In order to simplify the installation process, we provide the installTools.sh script, which contains the commands for installing each tool.

Below is a quick start of the pipeline, click here to access the complete pipeline manual.

Quick start

I. download the repository and extract the files to your home folder directory

cd ~
wget https://github.com/resendejss/PreProcSEQ/archive/refs/heads/main.zip
unzip main.zip

II. installation of tools

./installTools.sh

III. FASTQs quality control

Let's check the quality of each FASTQ file. The 0-samples directory contains the files.

./qualityControl_beforeTrimming.sh

FastQC results were saved to 1-qualityControl_beforeTrimming/outputFastQC and MultiQC results were saved to 1-qualityControl_beforeTrimming/outputMultiQC

III. trimming

./trimming_trimmomatic.sh

The resulting files from the Trimmomatic process are in 2-trimming/trimmomatic/paired and 2-trimming/trimmomatic/unpaired. In paired are the files that were removed from the low quality bases. Under unpaired are the readings that have been removed.

IV. quality control of FASTQs after trimming

./qualityControl_afterTrimming.sh

FastQC results are in PreProcSEQ-main/3-qualityControl_afterTrimming/outputFastqc and MultiQC results are in PreProcSEQ-main/3-qualityControl_afterTrimming/outputMultiqc.

V. transcript quantification

There are two quantification tool options: Salmon and Kallisto.

Salmon

# index construction
./salmon_index.sh

# quantification
./salmon_ quant.sh

Kallisto

# index construction
./kallisto_index.sh

# quantification
./kallisto_quant.sh

Salmon results will be in 4-quantification/salmon/quant_salmon. Kallisto results will be in 4-quantification/kallisto/quant_kallisto.

VI. construction of the gene expression matrix

tximeta

Running the R script via terminal:

Rscript matrixConstruction_tximeta_salmon.R

tximport

Running the R script via terminal:

# salmon output
Rscript matrixConstruction_tximport_salmon.R

# kallisto output
Rscript matrixConstruction_tximport_kallisto.R

The matrices will be in 5-expressionMatrix

VII. annotation of transcripts

Running the R script via terminal:

# matrix_kallisto_tximport
Rscript annotaionTranscripts_kallisto_matrixTximport.R

# matrix_salmon_tximport
Rscript annotationTranscript_salmon_matrixTximport.R

# matrix_kallisto_tximeta
Rscript annotationTranscripts_salmon_tximeta.R

VIII. normalization of counts

Rscript normalizationTMM.R

The results will be in 7-normalizationCounts/tmm

IX. batch effect removal

# counts
Rscript batchEffectRemoval_counts.R

# normalized data
Rscript batchEffectRemoval_TMM.R

The results will be in 8-batchEffect_removal

About

Quality control pipeline and pre-processing of data from RNA-Seq


Languages

Language:R 72.2%Language:Shell 27.8%