sequence_handling

A series of scripts to automate DNA sequence aligning and quality control workflows via list-based batch submission and parallel processing

Introduction

For greater detail about everything, please see the wiki for this repository

What is `sequence_handling` for?

sequence_handling is a series of scripts to automate and speed up DNA sequence aligning and quality control through the use of our workflow outlined here. This repository contains two general kinds of scripts: Shell Scripts and Batch Submission Scripts, with one exception.

The former group is designed to be run directly from the command line. These serve as partial dependency installers, a way to generate a list for batch submission, QSub starters, and others that have issues with either running in parallel or using the Portable Batch System due to memory issues. Running any of these scripts without any arguments generates a usage message for more details. Each script is named entirely in lower-case letters.

The latter group is designed to run the workflow in batch and in parallel. These scripts use a list of sequences, with full sequence paths, as their input and utilize GNU Parallel to speed up the analysis and work they are designed for. Due to the length of time and resources needed for these scripts to run, they are designed to be submitted to a job scheduler, specifically the Portable Batch System. Each script is named using capital and lower-case letters.

Finally, there is one script that is neither designed to run directly from the shell nor submitted to a job scheduler. This script, plot_cov.R is designed to be called by Plot_Coverage.sh for creating coverage plots. This is done automatically; one does not need to change this script unless they wish to change the graphing parameters.

NOTE: the latter group of scripts and read_mapping_start.sh are designed to use the Portable Batch System and run on the Minnesota Supercomputing Institute. Heavy modifications will need to be made if not using these systems.

Why use list-based batch submission?

Piping one sample alone through this workflow can take over 12 hours to completely run. Most sequence handling jobs are not dealing with one sample, so the amount of time to run this workflow increases drastically. List-based batch submission simplifies the amount of typing that one has to do, and enables parallel processing to decrease time spent waiting for samples to finish. An example list is shown velow

/home/path_to_sample/sample_001_R1.fastq.gz

/home/path_to_sample/sample_001_R2.fastq.gz

/home/path_to_sample/sample_003_R1.fastq.gz

/home/path_to_sample/sample_003_R2.fastq.gz

Why use parallel processing?

Parallel processing decreases the amount of time by running multiple jobs at once and keeping track of which are done, which are running, and which have yet to be run. This workflow, with the list-based batch submissions and parallel processing, both simplifies and quickens the process of sequence handling.

Do I have to use the entire workflow as is?

No, with the one exception of Plot_Coverage.sh and plot_cov.R, no two scripts are entirely dependent on one another. While all these scripts are designed to easily use the output from one to the next, these scripts are not required to achive the end result of sequence_handling. If you prefer tools other than the ones used within this workflow, you can modify or replace any or all of the scripts offered in sequence_handling. This creates a pseudo-modularity for the entire workflow that allows for customization for each and every user.

Dependencies

Due to the pseudo-modularity of this workflow, specific dependencies for each individual script are listed below. Some general dependencies for the workflow as a whole are listed here:

A quality trimmer, such as Seqqs, Sickle, and Scythe
Tools for plotting results, such as R
SAM file processing utilities, such as SAMTools and Picard
A quality control mechanism, such as FastQC
A read mapper, such as The Burrows-Wheeler Aligner (BWA)
GNU Parallel

Please note that this is not a complete list of dependencies. Check below for specific dependencies for each desired script.

When running these scripts on the Minnesota Supercomputing Institute's (MSI) resources, most dependencies are included through MSI's module system. These modules are set to be automatically called by each script that calls upon them. However, some dependencies are not available through MSI; please check each script for which dependencies need to be installed separately.

Shell Scripts

NOTE: Running any of these scripts without arguments generates a usage message for greater detail about how to use them

installer.sh

The installer.sh script installs Seqqs, Sickle, and Scythe for use with the Quality_Triming.sh script. It also has options for installing Bioawk, SAMTools and R, all dependencies for various scripts within this package.

dependencies

The installer.sh script depends on Git, Wget, the GNU Compiler Collection (GCC), and GNU Make to run.

sample_list_generator.sh

The sample_list_generator.sh script creates a list of samples using a directory tree for its searching. This will find all samples in a given directory and its subdirectories. Only use this if you are using all samples within a directory tree. sample_list_generator.sh is designed to be run from the command line directly.

dependencies

The sample_list_generator.sh script has no external dependencies.

read_counts.sh

The read_counts.sh script calls Bioawk to get accurate counts for read number for a list of samples. Output is written to a tab-delimited file file with sample name drawn from the file name for the list of samples.

dependencies

The read_counts.sh script depends on Bioawk to run.

read_mapping_start.sh

The read_mapping_start.sh script generates a series of QSub submissions for use with the Portable Batch System on MSI's resources. starts a series of BWA sessions to map reads back to a reference genome.

dependencies

The read_mapping_start.sh script depends on the Portable Batch System and BWA to run.

Batch Submission Scripts

NOTE: Each of these scripts contains usage information within the script itself. Furthermore, all values for these scripts are hard-coded into the script itself. Please open each script using your favourite text editor (ex. Vim, Sublime Text, Visual Studio Code, etc.) to read usage information and set values

Assess_Quality.sh

The Assess_Quality.sh script runs FastQC on the command line on a series of samples organized in a project directory for quality control. In addition, a list of all output zip files will be generated for use with the Read_Depths.sh script. Our recommendation is using this both before and after quality trimming and before read mapping. This script is designed to be run using the Portable Batch System.

dependencies

The Assess_Quality.sh script depends on FastQC, the Portable Batch System, and GNU Parallel to run.

Read_Depths.sh

The Read_Depths.sh script utilizes the output from FastQC to calculate the read depths for a batch of samples and outputs them into one convenient text file.

dependencies

The Read_Depths.sh script depends on the Portable Batch System and GNU Parallel to run.

Quality_Trimming.sh

The Quality_Trimming.sh script runs trim_autoplot.sh (part of the Seqqs repository on GitHub) on a series of samples organized in a project directory.. In addition to requiring Seqqs to be installed, this also requires GNU Parallel to be installed on the system.

dependencies

The Quality_Trimming.sh script depends on Sickle, Scythe, Seqqs, R, the Portable Batch System, and GNU Parallel to run.

SAM_Processing_SAMTools.sh

The SAM_Processing_SAMTools.sh script converts the SAM files from read mapping with BWA to the BAM format using SAMTools. In the conversion process, it will sort and deduplicate the data for the finished BAM file, also using SAMTools. Alignment statistics will also be generated for both raw and finished BAM files. A list of finished BAM files will be generated at the end of this script.

dependencies

The SAM_Processing_SAMTools.sh script depends on SAMTools, the Portable Batch System, and GNU Parallel to run.

SAM_Processing_Picard.sh

The SAM_Processing_Picard.sh script converts the SAM files from read mapping with BWA to the BAM format using SAMTools. In the conversion process, it will sort and deduplicate the data for the finished BAM file, using Picard. Alignment statistics will also be generated for both raw and finished BAM files. A list of finished BAM files will be generated at the end of this script.

NOTE: This script is extremely resource intensive, please use with caution.

NOTE: This script has not been tested, use with caution

dependencies

The SAM_Processing_Picard.sh script depends on SAMTools, Picard, the Portable Batch System, and GNU Parallel to run.

Coverage_Map.sh

The Coverage_Map.sh script generates coverage maps from BAM files using BEDTools. This map is in text format and is used for making coverage plots. In addition to generating coverage maps, this script will create a list of all the coverage maps generated for use in other scripts.

dependencies

The Coverage_Map.sh script depends on BEDTools, the Portable Batch System, and GNU Parallel to run.

Plot_Coverage.sh

The Plot_Coverage.sh script creates plots using R based off of coverage maps. It will generate three plots: one showing coverage across the genome, one showing coverage across exons, and one showing coverage across genes. This script uses plot_cov.R to generate the plots.

dependencies

The Plot_Coverage.sh script depends on the plot_cov.R script, R, the Portable Batch System, and GNU Parallel to run.

Other Scripts

plot_cov.R

The plot_cov.R script is the graphical brains behind the Plot_Coverage.sh script. The latter will automatically call upon the former to create the coverage plots based off coverage maps. It is not necessary to open this script directly, except for making modifications to the graphical parameters.

dependencies

The plot_cov.R script has no external dependencies.

TODO

~~Generalize read_counts.sh for any project.~~ DONE!
~~Add better list-out methods~~ DONE!
~~Fix memory issues with Read_Mapping.sh~~ ~~Redesign read mapping scripts~~ DONE!
~~Add coverage map script to workflow~~ ~~Finish integrating Coverage_Map.sh with the rest of the pipeline~~ DONE!
~~Get Plot_Coverage.sh and plot_cov.R integrated into the pipeline~~ DONE!
~~Add information about plot_cov.R to the README~~ DONE!
~~Add script to easily convert SAM files from Read_Mapping.sh to BAM files for Coverage_Map.sh~~ ~~DONE!~~ ~~ish...~~ DONE!
~~Add Deduplication script~~ Get ~~Deduplication.sh~~ SAM_Processing_Picard.sh working
~~Add read mapping statistics via samtools flagstat~~ DONE! This is integrated into SAM_Processing_SAMTools.sh
Incorporate variant calling scripts into the pipeline
keep README updated

neyhartj / sequence_handling

sequence_handling

A series of scripts to automate DNA sequence aligning and quality control workflows via list-based batch submission and parallel processing

Introduction

What is `sequence_handling` for?

Why use list-based batch submission?

Why use parallel processing?

Do I have to use the entire workflow as is?

Dependencies

Shell Scripts

installer.sh

dependencies

sample_list_generator.sh

dependencies

read_counts.sh

dependencies

read_mapping_start.sh

dependencies

Batch Submission Scripts

Assess_Quality.sh

dependencies

Read_Depths.sh

dependencies

Quality_Trimming.sh

dependencies

SAM_Processing_SAMTools.sh

dependencies

SAM_Processing_Picard.sh

dependencies

Coverage_Map.sh

dependencies

Plot_Coverage.sh

dependencies

Other Scripts

plot_cov.R

dependencies

TODO

About

Languages

sequence_handling

A series of scripts to automate DNA sequence aligning and quality control workflows via list-based batch submission and parallel processing

Introduction

What is sequence_handling for?

Why use list-based batch submission?

Why use parallel processing?

Do I have to use the entire workflow as is?

Dependencies

Shell Scripts

installer.sh

dependencies

sample_list_generator.sh

dependencies

read_counts.sh

dependencies

read_mapping_start.sh

dependencies

Batch Submission Scripts

Assess_Quality.sh

dependencies

Read_Depths.sh

dependencies

Quality_Trimming.sh

dependencies

SAM_Processing_SAMTools.sh

dependencies

SAM_Processing_Picard.sh

dependencies

Coverage_Map.sh

dependencies

Plot_Coverage.sh

dependencies

Other Scripts

plot_cov.R

dependencies

TODO

About

Languages

What is `sequence_handling` for?