bodeolukolu / ngsComposer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Introduction

ngsComposer is an automated pipeline for demultiplexing and quality-filtering Next-Generation Sequencing (NGS) reads.

For questions, bugs, and suggestions, please contact: bolukolu@utk.edu.

Developers: Ryan G. Kuster (UTK, TN) and Bode A. Olukolu (UTK, TN)

Features

  • Full start-to-finish pipeline for various NGS library types
  • Few dependencies (Python3 and R)
  • Easy to learn and designed for biologists
  • Supports variable length barcodes and dual-indexing.
  • Trims buffer sequences and quality filters on a read-by-read basis
  • Accepts project directory of multiple libraries

Contents

Installation

Currently, ngsComposer is only available for unix-based systems (i.e. macOS and linux).

Clone or download the Git repository to your desired folder

git clone https://github.com/bodeolukolu/ngsComposer.git

Dependencies:

  • Python3 version 3.5 or above
  • R, R-ggplot2
  • pigz (not required, but recommended for parallel gzip and gunzip i.e. faster)

For help troubleshooting installation, see the troubleshooting section

Usage

Basic usage

Set up your project directory containing the following:

  • A folder named "samples", which contains fastq file(s). Multiple libraries or demultiplexed fastq files can be included.
  • A file named "barcodes_lib1.txt", which contains barcodes and associated sample IDs. For additional libraries with different sample IDs specifiy "barcodes_lib2.txt", "barcodes_lib3.txt", ... "lib1_R1" and "lib1_R2" are variables in "config.sh" for R1 and R2 fastq file names, respectively, and correspond to "barcodes_lib1.txt". For barcodes/barcode pairs that is not assigned to any sample, indicate the sample id as NA or na.
  • 2 files containing adapter sequences of R1/P5/forward (<adapters.R1.txt>) and R2/P7/reverse (<adapters.R2.txt>) reads.
  • config.sh (see "Configuration" below for detailed instructions on creating this file) .

From command line, run ngsComposer as shown below:

$ bash <path_to_ngsComposer_directory>/ngsComposer <path_to_project_directory>

If this is the first time running the pipeline, you may need to wait for R to install the appropriate packages and dependencies.

Several example datasets are included in the "examples" directory. Users are encouraged to examine and run these small projects to assist in understanding pipeline functionality.


Overview

The order of steps in the ngsComposer pipeline are outlined in the following figure:

The steps implemented are first specified in a configuration file.


Configuration

Using a text editor, save a file containing any of the following variables as a bash script called 'config.sh' and include it in your project directory.

General parameters

Variable Default Usage Input required/Optional
threads total-2 choose maximum number of subprocesses that can run simultaneously integer optional
walkaway True run from beginning to end without pausing at qc steps True or False optional
cluster False run on compute cluster node (default: slurm) or workstation True or False optional
samples_alt_dir False input files stored in different different from project directory True or False optional
rm_transit True remove each transitional file folder to save space True or False optional

Input files

Variable Default Usage Input required/Optional
lib1_R1 na input fastq file name for R1/P5/forward reads string required
lib1_R2 na input fastq file name for R2/P7/reverse reads string optional
lib1_bc na name of file containing barcodes string required
lib2_R1 na additional input fastq file name for R1/P5/forward reads string required
lib2_R2 na additional input fastq file name for R2/P7/reverse reads string optional
lib2_bc na additional name of file containing barcodes string required

Tool Parameters

Variable Default Usage Input required/Optional
front_trim 0 number of bases in buffer sequence to trim integer optional
mismatch 1 number of mismatches (hamming distance) allowed in barcodes integer optional
R1_motif na motif filtering for R1/P5/forward reads string or list of comma-separated strings optional
R2_motif na motif filtering for R2/P7/reverse reads string or list of comma-separated strings optional
end_score 20 end-trim once entire window >= this Q score integer optional
window 10 size of window to test for >= end_trim integer optional
min_len 64 minimum read length to retain after end-trimming and adapter removal integer optional
adapter_match 12 number of base matches to identify adapters integer optional
q_min 20 Q score minimum (Phred value 0-40) applied to q_percent variable integer optional
q_percent 80 percentage of basses in read >= q_min Q scores integer optional

Visualizations

Variable Default Usage Input required/Optional
QC_demultiplexed na produce Quality score plot summary and/or full optional
QC_motif_validated na produce Quality score plot summary and/or full optional
QC_end_trimmed na produce Quality score plot summary and/or full optional
QC_adapter_removed na produce Quality score plot summary and/or full optional
QC_final na produce Quality score plot summary and/or full optional

**Note: na indicates that variable is user-defined. Analytical will be skipped if set to na.

An example configuration file may look like this:

config.sh

#General_parameters
###################################################
threads=24
walkaway=True
cluster=True
samples_alt_dir=False
rm_transit=True

#Input_files
###################################################
lib1_R1=test1_R1.fastq.gz
lib1_R2=test1_R2.fastq.gz
lib1_bc=barcodes_lib1.txt
lib2_R1=test2_R1.fastq.gz
lib2_R2=test2_R2.fastq.gz
lib2_bc=barcodes_lib2.txt

#Tool_parameters
###################################################
front_trim=6
mismatch=1
R1_motif=TGCATA,TGCATC,TGCATT
R2_motif=CATG
end_score=20
window=10
min_len=64
adapter_match=12
trim_homopolymer=10
q_min=20
q_percent=80

#Visualizations
###################################################
QC_demultiplexed=summary,full
QC_motif_validated=summary,full
QC_end_trimmed=summary,full
QC_adapter_removed=summary,full
QC_final=summary,full

In the above example, the maximum number of subprocesses spawned will be 24 (threads = 24). The pipeline will pause after relevant steps (walkaway = False) so users can view qc plots and have the option of modifying or bypassing the step. To save disk space, transitional directories will be removed (rm_transit = True) and only the final filtered data and any qc stats created in the pipeline will remain. Regardless of if walkaway if True or False, pipeline will ask if initial_qc should be generated.

A buffer sequence of length 6 (front_trim = 6) will be trimmed before demultiplexing, which will allow mismatch at a hamming distance of 1 (mismatch=1). For variable length barcodes, the same number of proximal bases (based on the minimum barcode length) are used for demultiplexing, while the additional distal bases in barcodes are trimmed off.

In this case, samples were double-digested with AluI and HaeIII and A-tailed before adapter ligation (R1_motif=TCC,TCT and R2_motif=TCC,TCT). Only reads containing these motifs will pass to subsequent steps. For A-tailed libraries, an A can be appended to the R1_motif and R2_motif strings.

Automatic end-trimming will be performed based on Q score. Here, groups of bases are considered within a moving window of 10 bases at a time (window=10) until that window consists only of the desired Q score at or above 20 (end_score=20). It is at this point that the read is trimmed. Reads that are less than 64 bp will be discarded (min_len=64)

Only reads that have a Q score of 20 (q_min=20) acrosss at least 95 percent of the read (q_percent=80) will pass to subsequent steps. If a R1 read or an R2 read passes while its partner fails, it will be placed into a single-end read subfolder and the failing read will be discarded.

Alternatively, a configuration file may only need to include necessary components for a run:

conf.py

#Input_files
###################################################
lib1_R1=test1_R1.fastq.gz
lib1_bc=barcodes_lib1.txt

*Since most of the parameters are hard-coded in an intuitive manner, by specifying only the fastq file name (at least single-end data) and associated barcode (only required for demultiplexing), the pipelines determines the other parameters as some stated below:

  • threads: computes available number of cores (n) and uses n-2 threads
  • defaults: walkaway=True, cluster=False, samples_alt_dir=False, rm_transit=True, front_trim=0, mismatch=1, no motif filtering, end_score=20, min_len=64, adapter_match=12, q_min=20, q_percent=80, only summary final QC, and initial QC will be determined based on a prompt before submitting job.

Demultiplexing

Barcodes file(s)

Optionally, one or more barcode files may be included in the project directory for demultiplexing. The following files are required at minimum:

  • barcodes_1.txt
  • index.txt

Naming conventions: "index.txt" is required, the barcodes file can be named as desired (see "Index file for directing multiple barcode files")

The barcodes file is a tab or space delimited file with no spaces in sample names (or, copy directly from your favorite spreadsheet program into a text file). Forward barcodes begin each row and reverse barcodes begin each column with the desired sample names indicated in the interior of the matrix. For example, the following would be required for a dual-indexed library:

barcodes_1.txt

	A	C	G	T
A	sample1	sample5	sample6	sample10
C	sample2	sample5	sample7	sample10
G	sample3	sample5	sample8	sample10
T	sample4	sample5	sample9	sample10

Note that in the example above the reverse barcode "C" corresponds with multiple identical sample names (sample5). While not common practice, ngsComposer accomodates repeated sample names and concatenates accordingly.

If reverse barcodes do not require demultiplexing, the barcode file can be set up as follows with "NA" or any other text used as a header in the first row:

barcodes_1.txt

	NA
A	sample1
C	sample2
G	sample3
T	sample4

Adapters

Adapters file(s)

Optionally, 'adapters.R2.txt' and 'adapters.R1.txt' may be included in the project directory for recognition and removal of adapters. The 'adapters.R2.txt' file contains the adapters expected to appear in the R1 reads. Adapter sequences should be newline-separated and be in 5' to 3' orientation. If libraries are barcoded, users are encouraged to provide adapter sequences that contain the corresponding barcodes expected in the opposing end of the read's adapter.

adapters.R2.txt

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCGCTCAGTTC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTATCTGACCT
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTATATGAGACG
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCTTATGGAAT
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTAATCTCGTC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGCGCGATGTT
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTAGAGCACTAG
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTGCCTTGATC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCTACTCAGTC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTCGTCTGACT
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGAACATACGG
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCCTATGACTC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTAATGGCAAG
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGTGCCGCTTC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCGGCAATGGA
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGCCGTAACCG
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTAACCATTCTC

Each of the above sample adapters is presented in 5' to 3' orientation and shares a common 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT' adapter sequence followed by expected barcodes. Adapter sequences may also include restriction motifs for greater detection, but these sequences will also be removed. Porifera.py creates all reverse-complements before detection.

When paired end data is used, as above, 'adapters.R1.txt' and 'adapters.R2.txt' must be provided. Adapters are tested for the inclusion of barcodes and only those combinations of R1/R2 barcodes leading to a given sample will be used to search for adapters quickly and with a lower false positive rate.

Standalone

All tools available in the ngsComposer pipeline can be called individually from the command line. Please see the ngsComposer Standalone Tools page for usage.

Related Software

Select Article Referencing ngsComposer

  1. ngsComposer: an automated pipeline for empirically based NGS data quality filtering. Kuster et al. 2021

Acknowledgment

This package has been developed as part of the Genomic Tools for Sweetpotato Improvement project (GT4SP) and SweetGAINS, both funded by Bill & Melinda Gates Foundation.

Troubleshooting

Python installation

To view Python version, from the terminal type:

$ python3 --version

If python3 is not found, you can try one of the python3 releases from the Python Software Foundation downloads page.

Alternatively, a package manager is an easy way to install Python from the terminal. For Ubuntu, Python can be installed directly using apt (replace 'X' with an existing version in the apt repository):

$ sudo apt-get update
$ sudo apt-get install python3.X

...or with homebrew on macOS using:

brew install python3

After installation please check that the newest version is present in your current environment (i.e.; $PATH).

R installation

To view R version, from the terminal type:

$ R --version

To install the newest version of R, see the releases available at the Comprehensive R Archive Network downloads page.

For Ubuntu, R can be installed directly using apt:

$ sudo apt update
$ sudo apt install r-base

...or with homebrew on macOS using:

brew install r

Notes on ggplot2 installation:

ngsComposer requires the R package ggplot2 and its dependencies. ngsComposer will attempt to automatically download these packages to the local ngsComposer repo (/ngsComposer/tools/helpers/R_packages).

The installation of ggplot2 and dependencies may take some time during the first use. If package installation fails, manual installation within R may be necessary. It may be beneficial to install R packages as root and if using macOS, ensure Xcode toolkit is up to date.

Versioning

Versioning will follow major.minor.patch semantic versioning format.

License

Apache License Version 2.0

About


Languages

Language:C++ 80.6%Language:HTML 15.8%Language:R 1.3%Language:SCSS 0.7%Language:JavaScript 0.5%Language:CSS 0.3%Language:C 0.2%Language:Python 0.1%Language:Nemerle 0.1%Language:PostScript 0.1%Language:Shell 0.1%Language:Roff 0.0%Language:TeX 0.0%Language:Lua 0.0%Language:SAS 0.0%Language:Assembly 0.0%Language:Makefile 0.0%Language:Less 0.0%Language:MATLAB 0.0%Language:Perl 0.0%Language:AppleScript 0.0%Language:M 0.0%