This repository is a usable, publicly available tutorial. All steps have been provided for the UConn CBC Xanadu cluster here with appropriate headers for the Slurm scheduler that can be modified simply to run. Commands should never be executed on the submit nodes of any HPC machine. If working on the Xanadu cluster, you should use sbatch scriptname after modifying the script for each stage. Basic editing of all scripts can be performed on the server with tools such as nano, vim, or emacs. If you are new to Linux, please use this handy guide for the operating system commands. In this guide, you will be working with common bioinformatic file formats, such as FASTA, FASTQ, SAM/BAM, and GFF3/GTF. You can learn even more about each file format here. If you do not have a Xanadu account and are an affiliate of UConn/UCHC, please apply for one here.
Contents
This tutorial will teach you how to use open source quality control, genome assembly, and assembly assessment tools to complete a high quality de novo assembly which is commonly utilized when you dont have a reference genome. Moving through the tutorial, you will take pair end short read data from a bacterial species and perform assemblies via various commonly used genome assmeblers. With these assemblies completed we will then need to assess the quality of the data produced. Once finished with the short read data we will move on to performing a long read assembly with long read nanopore data using basecalling and commonly used long read assemblers. Finally, we will then move on to Hybrid PacBio data.
The tutorial is organized into 3 parts:
- Short Read Assembly
- Long Read Assembly
- Hybrid Assembly
Once you git clone the repository, you can see three main folders:
Genome_Assembly/
├── short_read_assembly/
├── long_read_assembly/
└── hybrid_assembly/
In each folder it will contain the scripts to run each job in the HPC cluster. The tutorial will be using SLURM schedular to submit jobs to Xanadu cluster. In each script we will be using it will contain a header section which will allocate the resources for the SLURM schedular. The header section will contain:
#!/bin/bash
#SBATCH --job-name=JOBNAME
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 1
#SBATCH --mem=1G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=first.last@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
Before beginning, we need to understand a few aspects of the Xanadu server. When first logging into Xanadu from your local terminal, you will be connected to the submit node. The submit node is the interface with which users on Xanadu may submit their processes to the desired compute nodes, which will run the process. Never, under any circumstance, run processes directly in the submit node. Your process will be killed and all of your work lost! This tutorial will not teach you shell script configuration to submit your tasks on Xanadu. Therefore, before moving on, read and master the topics covered in the Xanadu tutorial.
When using short reads for genome assembly, the reads first need to be quality controled, and then will be assembled. In this section we will be showing how to use three assemblers: SOAPdenovo, SPAdes and MaSuRCA to assemble short reads and then how to evaluate the assembled reads.
In this study we are using short reads produced by Illumina MiSeq instrument. This data refers to a paired end reads with a read length of 250bp. The Nextera preperation method is sensitive and can produce lots of small fragments, shorter than 500bp. This resutls in R1 and R2 actually overlaping each other.
This has a estimated genome size is 3Mb and a inital coverage of 84X. As a problem you can try to calculate the coverage of this data set. Check out on how to calculate the coverage here.
In here we will use the FASTQC package to check the quality of the reads before and after trimming.
Quality check of raw reads:
mkdir -p RAWfastqc_OUT
fastqc -o ./RAWfastqc_OUT ../01_raw_reads/Sample_R1.fastq ../01_raw_reads/Sample_R2.fastq
Quality check of the reads show a adapter contamination of the reads from Nextera sequences, which need to be removed.
working directory
Genome_Assembly/
├── short_read_assembly/
├── 02_quality_control/
Trimmomatic is used to remove low quality and adapter sequences. The program can be used on single end as well as paired end sequences. Since we have identified the raw reads have Nextera sequences present we have the option to give specific sequences for the program to trim using the ILLUMINACLIP
command. More information on the specific options can be found in Trimmomatic website.
java -jar $Trimmomatic PE -threads 4 \
../01_raw_reads/Sample_R1.fastq \
../01_raw_reads/Sample_R2.fastq \
trim_Sample_R1.fastq trim_Sample_R1_singles.fastq \
trim_Sample_R2.fastq trim_Sample_R2_singles.fastq \
ILLUMINACLIP:NexteraPE-PE.fa:2:30:10 \
SLIDINGWINDOW:4:25 MINLEN:45
The complete slurm scrip is called sr_quality_control.sh. It will produce four fastq files, where two will contain trimmed reads of forward and reverse sequences and two files of single fastq sequences where only one sequence (forward or reverse) pass the filtering criteria.
02_quality_control/
├── trim_Sample_R1.fastq
├── trim_Sample_R1_singles.fastq
├── trim_Sample_R2.fastq
└── trim_Sample_R2_singles.fastq
after trimming the coverage will be around ~80X
mkdir -p TRIMfastqc_OUT
fastqc -o ./TRIMfastqc_OUT ./trim_Sample_R1.fastq ./trim_Sample_R2.fastq
FASTQC produces a HTML file with stats about your reads. You can download these HTML files to your local computer to view them using the transfer.cam.uchc.edu
submit node, which facilitate file transfer.
Working directory:
short_read_assembly/
├── 03_assembly/
│ ├── SOAP/
SOAP-denovo is a short read de novo assembler. When you do deep sequencing, and have multiple libraries, they will produce multiple sequencing files. A configuration file will let the assembler know, where to find these files. In here we will provide you with a configuration file.
The configuration file will have global information, and then multiple library sections. For global information, right now only max_rd_len is included. A read which is longer than this length will be cut to this length.
Followed by the global information, the library information of sequencing data should be organized under [LIB] tag.
Each libaray section will start with [LIB] tag: following are the items which it will include.
-
avg_ins : average insert size of this library or the peak value position in the insert size distribution.
-
reverse_seq : This option will take value 0 or 1. It tells the assembler if the read sequences need to be complementarily reversed. Illumima GA produces two types of paired-end libraries:
- forward-reverse, generated from fragmented DNA ends with typical insert size less than 500 bp;
- forward-forward, generated from circularizing libraries with typical insert size greater than 2 Kb;
The parameter "reverse_seq" should be set to indicate this:
0, forward-reverse; 1, forward-forward.
-
asm_flags=3 : This indicator decides in which part(s) the reads are used. It takes value:
- 1 : only contig assembly
- 2 : only scaffold assembly
- 3 : both contig and scaffold assembly
- 4 : only gap closure
-
rd_len_cutoff : The assembler will cut the reads from the current library to this length.
-
rank : it takes integer values and decides in which order the reads are used for scaffold assembly. Libraries with the same "rank" are used at the same time during scaffold assembly.
-
pair_num_cutoff : This parameter is the cutoff value of pair number for a reliable connection between two contigs or pre-scaffolds.
-
map_len : This takes effect in the "map" step and is the minimun alignment length between a read and a contig required for a reliable read location.
After the above tags the reads can be added in the following fashion:
- It will accept two formats: FASTA or FASTQ
- single end files are indicated in f=/path-to-file/ or q=/path-to-file/ : FASTA / FASTQ
- paired reads in two FASTA sequence files are indicated by: f1=/path-to-file/ and f2=/path-to-file/
- paired reads in two fastq sequences files are indicated by: q1=/path-to-file/ and q2=/path-to-file/
- paired reads in a single fasta sequence file is indicated by "p="
following is the configuration file we are using for our run:
#Global
#maximal read length
max_rd_len=250
#Library
[LIB]
#average insert size
avg_ins=470
#if sequence needs to be reversed in our case its; forward-reverse
reverse_seq=0
#both contig and scaffold assembly parts of the reads are used
asm_flags=3
#use only first 250 bps of each read
rd_len_cutoff=250
#in which order the reads are used while scaffolding
rank=1
# cutoff of pair number for a reliable connection (at least 3 for short insert size)
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
map_len=32
#PATH to FASTQ reads
q1=../../02_quality_control/trim_Sample_R1.fastq
q2=../../02_quality_control/trim_Sample_R2.fastq
The configuration file we created for this job is named as config_file and can be found in the SOAP/ folder.
Once the configuration file is ready, you can run the assembly job using the following command:
SOAPdenovo-127mer all \
-s config_file \
-K 31 \
-p 8 \
-R \
-o graph_Sample_31 1>ass31.log 2>ass31.err
SOAPdenovo-127mer all \
-s config_file \
-K 71 \
-p 8 \
-R \
-o graph_Sample_71 1>ass71.log 2>ass71.err
SOAPdenovo-127mer all \
-s config_file \
-K 101 \
-p 8\
-R \
-o graph_Sample_101 1>ass101.log 2>ass101.err
SOAPdenovo assembly options:
Usage: SOAPdenovo-127mer <command> [option]
all do pregraph-contig-map-scaff in turn
-s <string> configFile: the config file
-K <int> kmer(min 13, max 127): kmer size
-p <int> n_cpu: number of cpu for use
-R (optional) resolve repeats by reads
-o <string> outputGraph: prefix of output graph file name
NOTE A k-mer is a set of nucleotides, k is the number of nucleotides in that set. It is a crucial parameter in most de Brujin Graph assemblers and assemblers work with the highest accuracy if the k-mer size estimation is accurate.
The above script is called SOAPdenovo.sh and can be found in the SOAP/ directory. The script can be run using the sbatch
command.
It will produce bunch of files, and we are interested in the each k-mer scafold sequences produced, at the end of each run. These are the files which we will be used to asses the quality of our assembly.
SOAP
├── graph_Sample_31.scafSeq
├── graph_Sample_71.scafSeq
└── graph_Sample_101.scafSeq
Workding directory:
short_read_assembly/
└── 03_assembly/
└── SPAdes/
SPAdes - St. Petersburg genome assembler is a toolkit containing assembly pipelines. When SPAdes was initally designed it was for for small genonmes, like bacterial, fungal and other small genomes. SPAdes is not intened for larger genomes.SPAdes has different pipe-lines present and if you want to check them out please visit the SPAdes web site or its git-hub site.
Instead of manually selecting k-mers, SPAdes automatically selects k-mers based off the maximum read length data of your input. This is a called a de Bruijn graph based assembler, meaning that it assigns (k-1)-mers to nodes and every possible matching prefix and suffix of these nodes are connected with a line.
SPAdes takes as input paired-end reads, mate-pairs and single (unpaired) reads in FASTA and FASTQ. In here we are using paired-end reads and the left and right reads are held in two files and they will be taken into the assembler in the same order as in the files.
Command line options we used:
module load SPAdes/3.13.0
spades.py \
-1 ../../02_quality_control/trim_Sample_R1.fastq \
-2 ../../02_quality_control/trim_Sample_R2.fastq \
-s ../../02_quality_control/sinlges.fastq \
--careful \
--threads 8 \
--memory 30 \
-o .
Basic SPAdes command line would look like:
spades.py [options] -o <output_dir>
where the options we used:
Input
-1 file with forward paired-end reads
-2 file with reverse paired-end reads
-s file with unpaired reads
--careful tries to reduce number of mismatches and short indels
--threads number of threads
--memory RAM limit for SPAdes in Gb (terminates if exceeded) defaul is 250
NOTE
its very important to make sure you match the number of theads and the memory asked in the options section is matched with the SLURM header part of your script.
The full script for our SPAdes run is called SPAdes.sh and it can be found in the 03_assembly/SPAdes/ directory.
Once the assembly script is ran, it will produce bunch of files together with the final scafold file which is called, scaffolds.fasta. We will be using this final scaffold file to analyze the SPAdes assembly run.
SPAdes/
├── scaffolds.fasta
Working directory will be:
03_assembly/
├── MaSuRCA/
MaSuRCA (Maryland Super Read Cabog Assembler) is a combination of a De Bruijn graph and an Overlap-Layout-Consensus model. The Overlap-Layout-Consensus model consists of three steps, Overlap, which is the process of overlapping matching sequences in the data, this forms a long branched line. Layout, which is the process of picking the least branched line in from the overlap sequence created earlier, the final product here is called a contig. Consensus is the process of lining up all the contigs and picking out the most similar nucleotide line up in this set of sequences (OIRC).
When running MaSuRCA, there are few things you should keep in mind. This assembler, DOES NOT require a preprocessing step, such as trimming, cleaning or error correction step; you will directly feed the raw reads.
The first step is to create a configuration file. A sample file can be copied from the MaSuRCA instalation directory. The following command will copy it to your working folder.
cp $MASURCA/sr_config_example.txt config_file
The configuration file, will contain the location of the compiled assembler, the location of the data and some parameters. In most cases you only need to change the path to your read files.
Second step is to run the masurca script which will create a shell script called assembly.sh
using the configuration file.
Then the final step is to run this assembly.sh
script, which will create the scaffolds.
Lets look at the configuration file, which contain two sections, DATA and PARAMERTERS, and each section concludes with END section.
In the DATA section:
DATA
#Illumina paired end reads supplied as <two-character prefix> <fragment mean> <fragment stdev> <forward_reads> <reverse_reads>
PE= pe 480 20 ../../01_raw_reads/Sample_R1.fastq ../../01_raw_reads/Sample_R2.fastq
END
DATA section is where you should specify the input data for the assembler. Each library line should with the appropiate read type, eg: PE, JUMP, OTHER
. In the above DATA section we have specified Illumina paired end reads.
PE = two_letter_prefix mean stdev /path-to-forward-read /path-to-reverse-read
mean
= is the library insert average length
stdev
= is the stanard deviation. It this is not known set it as 15% of the mean
. If the reverse read is not avaliable do not specify this.
If you are interested in other types of reads and how to include them in the DATA section, more information can be found in the MaSuRCA git page.
In the PARAMETERS section:
PARAMETERS
#PLEASE READ all comments to essential parameters below, and set the parameters according to your project
#set this to 1 if your Illumina jumping library reads are shorter than 100bp
EXTEND_JUMP_READS=0
#this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content
GRAPH_KMER_SIZE = auto
#set this to 1 for all Illumina-only assemblies
#set this to 0 if you have more than 15x coverage by long reads (Pacbio or Nanopore) or any other long reads/mate pairs (Illumina MP, Sanger, 454, etc)
USE_LINKING_MATES = 0
#specifies whether to run the assembly on the grid
USE_GRID=0
#specifies grid engine to use SGE or SLURM
GRID_ENGINE=SGE
#specifies queue (for SGE) or partition (for SLURM) to use when running on the grid MANDATORY
GRID_QUEUE=all.q
#batch size in the amount of long read sequence for each batch on the grid
GRID_BATCH_SIZE=500000000
#use at most this much coverage by the longest Pacbio or Nanopore reads, discard the rest of the reads
#can increase this to 30 or 35 if your reads are short (N50<7000bp)
LHE_COVERAGE=25
#set to 0 (default) to do two passes of mega-reads for slower, but higher quality assembly, otherwise set to 1
MEGA_READS_ONE_PASS=0
#this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms
LIMIT_JUMP_COVERAGE = 300
#these are the additional parameters to Celera Assembler. do not worry about performance, number or processors or batch sizes -- these are computed automatically.
#CABOG ASSEMBLY ONLY: set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.
CA_PARAMETERS = cgwErrorRate=0.25
#CABOG ASSEMBLY ONLY: whether to attempt to close gaps in scaffolds with Illumina or long read data
CLOSE_GAPS=1
#auto-detected number of cpus to use, set this to the number of CPUs/threads per node you will be using
NUM_THREADS = 8
#this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*20
JF_SIZE = 300000000
#ILLUMINA ONLY. Set this to 1 to use SOAPdenovo contigging/scaffolding module. Assembly will be worse but will run faster. Useful for very large (>=8Gbp) genomes from Illumina-only data
SOAP_ASSEMBLY=0
#Hybrid Illumina paired end + Nanopore/PacBio assembly ONLY. Set this to 1 to use Flye assembler for final assembly of corrected mega-reads. A lot faster than CABOG, at the expense of some contiguity. Works well even when MEGA_READS_ONE_PASS is set to 1. DO NOT use if you have less than 15x coverage by long reads.
FLYE_ASSEMBLY=0
END
The full configuration file in our run is called config_file and it can be found in the MaSuRCA/ directory. Once the configuration file is set up you can run MaSuRCA using:
module load MaSuRCA/3.3.4
masurca config_file
bash assemble.sh
module unload MaSuRCA/3.3.4
The full script for running MaSuRCA is called MASuRCA.sh and can be found in the 03_assembly/MaSuRCA/ folder.
Final assembly scaffolds can be found under the CA/ folder:
MaSuRCA/
├── CA/
│ ├── final.genome.scf.fasta
Once we have the contigs/scaffolds assembled, next step is to check the quality of these. In here we will use 4 methods to check, which will be depicted in the following:
QUAST will be used to evaluate genome assemblies. We will be using the program QUAST which will give us the number of contigs, total length and N50 value; the data we are most interested in. A good assembly would have small number of contigs, a total length that makes sense for the specific species, and a large N50 value. N50 is a measure to describe the quality of assembled genomes fragmented in contigs of different length. The N50 is the minimum contig length needed to cover 50% of the genome.
Working directory:
short_read_assembly/
└── 04_quast
- For SOAPdenovo scaffolds we will be using:
module load quast/5.0.2
quast.py ../03_assembly/SOAP/graph_Sample_31.scafSeq \
--threads 8 \
-o SOAP_31
quast.py ../03_assembly/SOAP/graph_Sample_71.scafSeq \
--threads 8 \
-o SOAP_71
quast.py ../03_assembly/SOAP/graph_Sample_101.scafSeq \
--threads 8 \
-o SOAP_101
The full script is called quast_SOAP.sh and can be found in the 04_quast directory.
- For SPAdes; the command we use would be like:
module load quast/5.0.2
quast.py ../03_assembly/SPAdes/scaffolds.fasta \
--threads 8 \
-o SPAdes
The full script is called quast_SPAdes.sh and can be found in the 04_quast directory.
- For MaSuRCA; the command we use would be like:
module load quast/5.0.2
quast.py ../03_assembly/MaSuRCA/CA/final.genome.scf.fasta \
--threads 8 \
-o MaSuRCA
The full script is called quast_MaSuRCA.sh, and can be found in the 04_quast directory.
General command for executing quast would be like:
quast.py [options] <files_with_contigs>
Options:
-o --output-dir Directory to store all result files
-t --threads Maximum number of threads [default: 25% of CPUs]
These are the minimum command that we used, there are many other options that you can use with the software and if you are interested please have a look in to the QUAST manual.
Once executed these scripts using the sbatch
command, you will end of with basic evaluation of the assemblies. These statics can be found in each new folder you created:
04_quast/
├── MaSuRCA
│ └── report.txt
├── SOAP_31
│ └── report.txt
├── SOAP_71
│ └── report.txt
├── SOAP_101
│ └── report.txt
└── SPAdes
└── report.txt
SOAP-31 | SOAP-71 | SOAP-101 | SPAdes | MaSuRCA | |
---|---|---|---|---|---|
contigs (>= 0 bp) | 804 | 1093 | 666 | 118 | 86 |
contigs (>= 1000 bp) | 153 | 168 | 228 | 68 | 77 |
contigs (>= 5000 bp) | 112 | 116 | 139 | 54 | 69 |
contigs (>= 10000 bp) | 91 | 88 | 98 | 46 | 62 |
contigs (>= 25000 bp) | 42 | 40 | 30 | 33 | 40 |
contigs (>= 50000 bp) | 10 | 10 | 5 | 18 | 18 |
Total length (>= 0 bp) | 3014979 | 3009525 | 2928498 | 2882278 | 2825384 |
Total length (>= 1000 bp) | 2913057 | 2855131 | 2826267 | 2866062 | 2817661 |
Total length (>= 5000 bp) | 2802562 | 2719768 | 2584824 | 2827098 | 2796700 |
Total length (>= 10000 bp) | 2639118 | 2521114 | 2293266 | 2771061 | 2745806 |
Total length (>= 25000 bp) | 1840014 | 1726238 | 1176782 | 2567729 | 2388905 |
Total length (>= 50000 bp) | 762950 | 688478 | 321294 | 2006774 | 1610768 |
no. of contigs | 171 | 189 | 260 | 76 | 86 |
Largest contig | 104842 | 91098 | 91744 | 198900 | 220219 |
Total length | 2926296 | 2869922 | 2849555 | 2871195 | 2825384 |
GC (%) | 32.55 | 32.62 | 32.62 | 32.65 | 32.75 |
N50 | 30720 | 32160 | 22066 | 106730 | 54020 |
N75 | 16594 | 16380 | 12093 | 39103 | 32399 |
L50 | 29 | 30 | 41 | 10 | 15 |
L75 | 59 | 61 | 84 | 22 | 31 |
According to our requirements regarding n50 and contigs it would appear that the best assembly perfromed was via SPAdes. (N50 value indicates that half the genome is assembled on contigs/scaffolds of length N50 or longer)
Bowtie2 is a tool you would use for comparitive genomics via alignment. Bowtie2 takes a Bowtie 2 index and a set of sequencing read files and outputs a set of alignments in a SAM file. Alignment is the process where we discover how and where the read sequences are similar to a reference sequence. An alignment is a way of lining up the characters in the read with some characters from the reference to reveal how they are similar.
Alignemnt is a method of doing a educated guess, as where the read originated with respect to the reference genome. It is not always possibe to determine this with clarity.
Bowtie2 in our case takes read sequences and aligns them with long reference sequences. Since this is de novo assembly you will take the data from the assemblies you have and align them back to the raw read data. You want to use unpaired data.
Our working directory will be:
short_read_assembly/
├── 05_bowtie2
First step in aligning reads with bowtie2 is to make a index of the genome we assembled. This can be done with the bowtie2-build
command.
bowtie2-build [options] <reference-index> <index-base-name>
Once you build the index, next step would be to align the reads to the genome using the index you build.
bowtie2 [options] -x <bt-index> -1 <paired-reads> -2 <paired-reads> -S <SAM-output> \
--threads 8 2>output.err
In here we will direct the error file output which will contain the alignment statistics.
Since we are using three de novo methods to construct genomes, we will now try to see how each of them are aligning its reads to the genome constructed.
- SOAP
mkdir -p SOAP_31_index SOAP_71_index SOAP_101_index
module load bowtie2/2.3.5.1
## SOAP_31
bowtie2-build \
--threads 8 \
../03_assembly/SOAP/graph_Sample_31.scafSeq SOAP_31_index/SOAP_31_index
bowtie2 -x SOAP_31_index/SOAP_31_index \
-1 ../02_quality_control/trim_Sample_R1.fastq -2 ../02_quality_control/trim_Sample_R2.fastq \
-S SOAP_31.bowtie2.sam \
--threads 8 2>SOAP_31.err
## SOAP_71
bowtie2-build \
--threads 8 \
../03_assembly/SOAP/graph_Sample_71.scafSeq SOAP_71_index/SOAP_71_index
bowtie2 -x SOAP_71_index/SOAP_71_index \
-1 ../02_quality_control/trim_Sample_R1.fastq -2 ../02_quality_control/trim_Sample_R2.fastq \
-S SOAP_71.bowtie2.sam \
--threads 8 2>SOAP_71.err
## SOAP_101
bowtie2-build \
--threads 8 \
../03_assembly/SOAP/graph_Sample_101.scafSeq SOAP_101_index/SOAP_101_index
bowtie2 -x SOAP_101_index/SOAP_101_index \
-1 ../02_quality_control/trim_Sample_R1.fastq -2 ../02_quality_control/trim_Sample_R2.fastq \
-S SOAP_101.bowtie2.sam \
--threads 8 2>SOAP_101.err
The full slurm script is called bowtie2_SOAP.sh, and can be found in the 05_bowtie2 directory.
- SPAdes
# SPAdes
mkdir -p SPAdes_index
module load bowtie2/2.3.5.1
bowtie2-build \
--threads 8 \
../03_assembly/SPAdes/scaffolds.fasta SPAdes_index/SPAdes_index
bowtie2 -x SPAdes_index/SPAdes_index \
-1 ../02_quality_control/trim_Sample_R1.fastq -2 ../02_quality_control/trim_Sample_R2.fastq \
-S SPAdes.bowtie2.sam \
--threads 8 2>SPAdes.err
The full slurm script for running bowtie2 for genome created with is called bowtie2_SPAdes.sh, which can be found in 05_bowtie2/ directory.
- MaSuRCA
mkdir -p MaSuRCA_index
module load bowtie2/2.3.5.1
bowtie2-build \
--threads 8 \
../03_assembly/MaSuRCA/CA/final.genome.scf.fasta MaSuRCA_index/MaSuRCA_index
bowtie2 -x MaSuRCA_index/MaSuRCA_index \
-1 ../02_quality_control/trim_Sample_R1.fastq -2 ../02_quality_control/trim_Sample_R2.fastq \
-S MaSuRCA.bowtie2.sam \
--threads 8 2>MaSuRCA.err
The full slurm script for running bowtie2 for genome created with MaSuRCA is called bowtie2_MaSuRCA.sh, which can be found in 05_bowtie2/ directory.
As shown for the SOAP_31: it will create the following files and folders for each above (SOAP_35, SOAP_45, SPAdes, MaSuRCA) runs:
05_bowtie2/
├── SOAP_31_index/
│ ├── SOAP_31_index.1.bt2
│ ├── SOAP_31_index.2.bt2
│ ├── SOAP_31_index.3.bt2
│ ├── SOAP_31_index.4.bt2
│ ├── SOAP_31_index.rev.1.bt2
│ └── SOAP_31_index.rev.2.bt2
├── SOAP_31.bowtie2.sam
└── SOAP_31.err
We have only shown the output directory and files for SOAP_131 case, as an example.
The alingment data will be in *.err
files associated with each run.
05_bowtie2/
├── SOAP_31.err
├── SOAP_71.err
├── SOAP_101.err
├── SPAdes.err
└── MaSuRCA.err
Following table summerizes the alignment results with bowtie2.
SOAP-31 | SOAP-71 | SOAP-101 | SPAdes | MaSuRCA | |
---|---|---|---|---|---|
reads | 305418 | 305418 | 305418 | 305418 | 305418 |
paired | 305418 | 305418 | 305418 | 305418 | 305418 |
aligned concordantly 0 times | 36.59% | 26.66% | 22.01% | 19.22% | 21.67% |
aligned concordantly exactly 1 time | 63.41% | 73.05% | 77.33% | 79.73% | 75.02% |
aligned concordantly >1 times | 0.01% | 0.30% | 0.66% | 1.04% | 3.31% |
overall alignment rate | 84.96% | 93.98% | 98.10% | 99.83% | 96.49% |
Between de-novo assemblies it shows that, assembly done with SPAdes have a good overall alignment rate, and higher number of reads would match exactly 1 time to the reference genome.
Here, we describe the use of the BUSCO tool suite to assess the completeness of genomes, gene sets, and transcriptomes, using their gene content as a complementary method to common technical metrics.
To access the quality and completeness of a assembly, different matrices can be used. Such as alignment ratios and contig/scaffold length distributions as N50 will reflect the assembly completeness. The above technics ignore the biological aspects regarding the gene content. This would be an important aspect where you have transcriptomic data. You can test comprehensiveness of the gene set by aligning the transcripts to the assembly to assess how much is aligned. However, aligning the spliced transcripts to genomic regions can be problematic and it depends on the tools and parameters used for mapping. This leads us to look for another alternative, and BUSCO: Benchmarking Universal Single-Copy Orthologs provides a way of assessing genome assemblies.
BUSCO data set is made up of protein coding genes from the OrthoDB orthologus groups, that contain single copy genes in that group. The consensus sequences profiles are built from multiple alignment across the species group. As more species sequenced the BUSCO data set will be updated with more lineages. This program assists with checking assemblies, gene sets, annotations, and transcriptomes to see if they appear complete, using their gene content as a complementary method to common technical metrics. It does this by taking an orthologous gene set of your species of interest and comparing it back to the genome of interest, taking into consideration possibly evolutionary changes.
We will now try to evaluate our three assembled genomes using BUSCO. For this we will be using the BUSCO database which is downloaded in Xanadu cluster (/isg/shared/databases/BUSCO/
). When you are using the BUSCO database make sure you are using the latest database by checking the BUSCO database page. In this tutorial we are using the current database version which is ‘odb10’, which is compatible with BUSCO software version 4.
Our working directory will be:
short_read_assembly/
└── 06_busco
Following will be the commands which will be used for evaluating SOAP, SPAdes, MaSuRCA assemblies using BUSCO.
- SOAP:
busco -i ../03_assembly/SOAP/graph_Sample_31.scafSeq \
-o SOAP_31 -l /isg/shared/databases/BUSCO/odb10/bacteria_odb10 -m genome
busco -i ../03_assembly/SOAP/graph_Sample_71.scafSeq \
-o SOAP_71 -l /isg/shared/databases/BUSCO/odb10/bacteria_odb10 -m genome
busco -i ../03_assembly/SOAP/graph_Sample_101.scafSeq \
-o SOAP_101 -l /isg/shared/databases/BUSCO/odb10/bacteria_odb10 -m genome
The full script for evaluating the three SOAP assemblies is called, busco_SOAP.sh which can be found in short_read_assembly/06_busco
folder.
- SPAdes:
busco -i ../03_assembly/SPAdes/scaffolds.fasta \
-o SPAdes \
-l /isg/shared/databases/BUSCO/odb10/bacteria_odb10 \
-m genome
The full script for evaluating the SPAdes assembly is called, busco_SPAdes.sh which can be found in short_read_assembly/06_busco
folder.
- MaSuRCA:
busco -i ../03_assembly/MaSuRCA/CA/final.genome.scf.fasta \
-o MaSuRCA \
-l /isg/shared/databases/BUSCO/odb10/bacteria_odb10 \
-m genome
The full script for evaluating the MaSuRCA assembly is called, busco_MaSuRCA.sh which can be found in short_read_assembly/06_busco
folder.
So the general command for evaluating the assembly will be:
usage: busco -i [SEQUENCE_FILE] -l [LINEAGE] -o [OUTPUT_NAME] -m [MODE] [OTHER OPTIONS]
The command options which we will be using:
-i FASTA FILE Input sequence file in FASTA format
-l LINEAGE Specify the name of the BUSCO lineage
-o OUTPUT Output folders and files will be labelled with this name
-m MODE BUSCO analysis mode
- geno or genome, for genome assemblies (DNA)
- tran or transcriptome, for transcriptome assemblies (DNA)
- prot or proteins, for annotated gene sets (protein)
In our run we will be evaluating a genome assembly, which is nucleotide sequence in contigs or scaffolds. In the genome run we will be using the options specified above. In this example we are only selecting to scan over the bacterial database in BUSCO. Make sure to use the database which suites your assembly, to check the other databases in the BUSCO database please check the BUSCO website or to check the current databases simply use the following command after loading the module: busco --list-datasets
. In Xanadu cluster, we have downloaded the BUSCO databases so you do not have to download by your self, to check the path associated to calling the lineages please check the Xanadu database page.
Once you have executed the above commands the BUSCO output will contain a printed score to the standard output file and also a comprehensive output to the ‘short_summary.specific.bacteria_odb10.[OUTPUT_folder_name].txt’ file inside the output destination specified in your command. In our case:
06_busco/
├── SOAP_31/
│ └── short_summary.specific.bacteria_odb10.SOAP_31.txt
├── SOAP_71/
│ └── short_summary.specific.bacteria_odb10.SOAP_71.txt
├── SOAP_101/
│ └── short_summary.specific.bacteria_odb10.SOAP_101.txt
├── SPAdes/
│ └── short_summary.specific.bacteria_odb10.SPAdes.txt
└── MaSuRCA/
└── short_summary.specific.bacteria_odb10.MaSuRCA.txt
The the above output will contain:
SOAP_31
C:86.3%[S:86.3%,D:0.0%],F:8.9%,M:4.8%,n:124
SOAP_71
C:91.1%[S:91.1%,D:0.0%],F:8.1%,M:0.8%,n:124
SOAP_101
C:98.4%[S:98.4%,D:0.0%],F:1.6%,M:0.0%,n:124
SPAdes
C:100.0%[S:100.0%,D:0.0%],F:0.0%,M:0.0%,n:124
MaSuRCA
C:100.0%[S:100.0%,D:0.0%],F:0.0%,M:0.0%,n:124
These BUSCO output will produce its output using a scoring scheme: C:complete [S:single-copy, D:duplicated], F:fragmented, and M:missing and the total BUSCO genes are indicated in n:.
To judge the score, you need to consider the type of sequence first. A model organism with a reference genome often will reach a score of 95% or above as a complete score and a non-model organisms can reach a score from 50% to 95% complete. This alone will not give an idea on how good the assembly is, as you need to look at the assembly and the annotation results together to make a judgement.
In full_table.tsv file it will contain the detailed list of BUSCO genes and their predicted status in the genome. These files can be found in:
06_busco/
├── SOAP_31
│ └── run_bacteria_odb10
│ └── full_table.tsv
├── SOAP_71
│ └── run_bacteria_odb10
│ └── full_table.tsv
├── SOAP_101
│ └── run_bacteria_odb10
│ └── full_table.tsv
├── SPAdes
│ └── run_bacteria_odb10
│ └── full_table.tsv
└── MaSuRCA
└── run_bacteria_odb10
└── full_table.tsv
If you look into these files you will find the above information. As an example we are showing few lines of the output of SPAdes full_table.tsv file.
# Busco id Status Sequence Gene Start Gene End Score Length
4421at2 Complete NODE_19_length_52925_cov_29.157601_19 21444 25067 1718.7 1051 https://www.orthodb.org/v10?query=4421at2 DNA-directed RNA polyme
9601at2 Complete NODE_19_length_52925_cov_29.157601_20 25204 28755 1299.7 812 https://www.orthodb.org/v10?query=9601at2 DNA-directed RNA polyme
26038at2 Complete NODE_5_length_175403_cov_22.338455_89 93427 95616 332.5 404 https://www.orthodb.org/v10?query=26038at2 phosphoribosylf
Other than the options we used there are many other options you can use depending on your data. More detailed view of the options are discribed in the BUSCO manual as well as in the paper: PMID: 31020564.
Long read sequencing in general has the potential to overcome the limitations that short read sequencing methods process. Long read sequencing has few main advantages over the short read technologies today. Notably, these long sequences are produced from single DNA molecules, and the sequencing happens in real time and both sequencing and library preparation does not need the PCR amplification, which reduces the PCR based errors. This reduce the downstream errors occurring from copy errors, sequences dependent biases. Since this keeps the DNA in its native state, this gives the opportunity to detect DNA modifications such as methylation modifications. Two of the most prominent long read sequencing methods are developed by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). Both single molecule real-time (SMRT) sequencing by PacBio and nanopore sequencing by ONT, use single molecule sequencing techniques.
In this tutorial we will focus on long reads produced by the nanopore sequencer (ONT). The basic concept of the method was developed, on the basis of understanding the change of potential patterns, when ions passes though a channel with a potential gradient. The method was developed, to identify the signal when a single DNA molecule is passed through a very narrow channel. In a sequencing run, when a DNA molecule is passed through the channel, the signal generated reflect the modulation of the ionic current in the pore. The signals produced, are stored in ‘FAST5’ format, a specialization of the HDF5 format.
During our workflow, we will take basecalled reads and assemble the long reads using three assemblers (Flye, Shasta). Then purge Haplotags will be used to assure that the contigs that are assembled are not being combined with the Haplotig of that sequence. After this, the assembly will be polished via Nanopolish. The quality of the genome assembled, will be assessed using QUAST.
Raw reads produced by the Oxford nanopore sequencer are stored in fast5 format. Sequencing information is stored in these files as electric signal, needs to be converted into nucleic acid (A, T, G and C) format, which we can understand. This process is done by a basecalling software, called Guppy and this process is done my the sequencing center once the sequencing is done. Once the base calling is done the reads will be filtered into two groups as passed and failed. Then the passed reads will be concatenated into a single file. Following is a guppy basecaller command which is used to basecall from the rawreads.
module load guppy/2.3.1
guppy_basecaller \
-i ../01_raw_reads \
-s . \
--flowcell FLO-MIN106 \
--kit SQK-LSK109 \
--qscore_filtering \
--cpu_threads_per_caller 16
This raw fasta file is located in the following directory:
02_basecall_pass
├── 5074_test_LSK109_30JAN19-reads-pass.fasta
└── sequencing_summary.txt
This will be the fasta file we will be using to demonstrate the assembling of long reads.
In here we use nanoplot a tool developed to evaluate the statistics of long read data of Oxford Nanopore Technologies and Pacific Biosciences.
Working directory:
long_read_assembly/
├── 03_qc
Full slrum script called nanoplot.sh is located in the 03_qc
directory.
NanoPlot --summary ../02_basecall_pass/sequencing_summary.txt \
--loglength \
-o summary-plots-log-transformed \
-t 10
This will create a summary files and figures which can be found in the output directory specified in the command along with a HTML report.
Here we will be using Centrifuge software to classify DNA sequences from microbial samples. It uses a novel indexing scheme which uses a relatively small index to perform a fast classification.
The command used:
centrifuge -f \
-x /isg/shared/databases/centrifuge/b+a+v+h/p_compressed+h+v \
--report-file report.tsv \
--quiet \
--min-hitlen 50 \
-U ../02_basecall_pass/5074_test_LSK109_30JAN19-reads-pass.fasta
Command options:
Usage: centrifuge [options]* -x <cf-idx> -U <r> [-S <filename>] [--report-file <report>]
<cf-idx> Index filename prefix (minus trailing .X.cf)
-U can be comma-separated file lists
--min-hitlen minimum length of partial hits (default 22, must be greater than 15)
-p/--threads number of alignment threads
--quiet print nothing to stderr except serious errors
The complete slurm script is called centrifuge.sh can be found in the 03_centrifuge folder. This will produce a report and classification report based on the contaminants. We will use a python script to remove the contaminates from the fasta file.
Inital coverage = 112X
Post centrifuge coverage = 64X
Working directory:
long_read_assembly/
├── ILLUMINA_DATA/
│ ├── 02_Kraken/
For this species we also had Illumina short reads. In order to check the contaminate in the reads we used Kraken software. Illumina short reads were initially trimmed using Trimmomatic and before running the contaminant screening using Kraken.
Kraken command:
module load kraken/2.0.8-beta
module load jellyfish/2.2.6
kraken2 -db /isg/shared/databases/kraken2/Standard \
--paired cat_R1.fq cat_R2.fq \
--use-names \
--threads 16 \
--output kraken_general.out \
--unclassified-out unclassified#.fastq \
--classified-out classified#.fastq \
--report kraken_report.txt \
--use-mpa-style
Complete slurm script is called Kraken.sh. This will create the sequence output, report and a summary report.
├── classified_1.fastq
├── classified_2.fastq
├── kraken_report.txt
├── kraken_general.out
├── unlassified_1.fastq
└── unlassified_2.fastq
According to the Kraken2 report we can identify the following summay using:
grep -v "|" kraken_report.txt
which will give :
d__Bacteria 197855570
d__Eukaryota 1270391
d__Archaea 62343
d__Viruses 15501
s__unidentified 6
Also if you look at the *.err file it will show:
254808872 sequences (47531.49 Mbp) processed in 1266.999s (12066.7 Kseq/m, 2250.90 Mbp/m).
199246490 sequences classified (78.19%)
55562382 sequences unclassified (21.81%)
Initial coverage of Illumina data (post Trimmomatic) = 50X
After Kraken filtering = 11X
In this step we will be using two asseblers Flye and Shasta on the base called ONT data.
Working directory:
long_read_assembly/
├── 04_assembly
Flye assembler takes data from Pacbio or Oxford Nanopore technologies sequencers and outputs polished contigs. It will repeat graph, that is similar in appearance to the De Bruijn graph. The manner in which this graph is assembled reveals the repeats in the genome allowing for the most accurate assembly.
Command for running flye assembly:
flye --nano-raw ../../03_centrifuge/physcomitrellopsis_africana_rmv_contam.fasta \
--genome-size 500m \
--threads 32 \
--out-dir /UCHC/PublicShare/CBC_Tutorials/Genome_Assembly/long_read_assembly/04_assembly/flye
Useage information:
usage: flye
--nano-raw ONT raw reads
--genome-size estimated genome size
--out-dir Output directory
--threads number of parallel threads
Full slurm script is called flye.sh can be found in the 04_assembly/flye
directory. This will create he following files inside the flye_assembly folder, and the assembled fasta file is called scaffolds.fasta or assembly.fasta.
flye/
├── 00-assembly
├── 10-consensus
├── 20-repeat
├── 21-trestle
├── 30-contigger
├── 40-polishing
├── assembly.fasta
├── assembly.fasta.fai
├── assembly.fasta.mmi
├── assembly_graph.gfa
├── assembly_graph.gv
├── assembly_info.txt
├── params.json
└── scaffolds.fasta
In this section we will use flye assembly with 3 polishing iterations.
flye --nano-raw ../../02_basecall_pass/5074_test_LSK109_30JAN19-reads-pass.fasta \
--genome-size 500m \
--threads 32 \
--iterations 3 \
--out-dir .
Flye usage:
usage: flye ( --nano-raw ) file1 [file_2 ...] [options]
optional arguments:
--nano-raw path [path ...] ONT raw reads
--genome-size size estimated genome size (for example, 5m or 2.6g)
--out-dir path Output directory
--iterations int number of polishing iterations
--threads int number of parallel threads
The full slurm script is called flye.sh.
The output will contain:
flye_polish/
├── 00-assembly
├── 10-consensus
├── 20-repeat
├── 21-trestle
├── 30-contigger
├── 40-polishing
├── assembly.fasta
├── assembly_graph.gfa
├── assembly_graph.gv
├── assembly_info.txt
├── flye.log
├── flye.sh
├── params.json
└── scaffolds.fasta --> [full-path]/assembly.fasta
Shasta assembler is developed keeping in mind to produce a rapid and accurate assembly and polishing. More information on the shastar assembler can be found over here. Here we are using the Shasta assembler for the ONT reads.
Working directory is:
long_read_assembly/
├── 04_assembly/
│ ├── shasta/
Command for running shasta assembly:
shasta --input ../../03_centrifuge/physcomitrellopsis_africana_rmv_contam.fasta \
--Reads.minReadLength 1000 \
--memoryMode anonymous \
--memoryBacking 4K \
--threads 32
Useage information:
shasta [options]
--input Names of input files containing reads
--Reads.minReadLength Read length cutoff. Reads shorter than this number of bases are discarded
--memoryMode Specify whether allocated memory is anonymous or backed by a filesystem
--memoryBacking Specify the type of pages used to back memory
--threads Number of threads
Full slurm script is called shasta.sh can be found in the /long_read_assembly/04_assembly/shasta/
directory. After running the command it will create Assembly.fasta file with other files:
shasta/
├── ShastaRun
│ ├── Assembly.fasta
└── shasta.sh
In this section we will be evaluating the initial assembly created from the above assemblers. Working directory:
long_read_assembly/
├── 05_initial_assembly_evaluation/
In here we will evaluate the assemblies using BUSCO. Working directory:
05_initial_assembly_evaluation/
├── busco/
Following commands will be used to evaluate the flye and shasta initial assemblies.
- flye:
busco -i ../04_assembly/flye/assembly.fasta \
-o busco_flye -l /isg/shared/databases/BUSCO/odb10/viridiplantae_odb10 -m genome
Complete slurm script called busco_flye.sh can be found in the busco directory.
- shasta:
busco -i ../04_assembly/shasta/ShastaRun/Assembly.fasta \
-o busco_shasta -l /isg/shared/databases/BUSCO/odb10/viridiplantae_odb10 -m genome
Complete slurm script called busco_shasta.sh can be found in the busco directory.
- flye with polish:
busco -i ../../04_assembly/flye_polish/assembly.fasta \
-o busco_flye_polish -l /isg/shared/databases/BUSCO/odb10/viridiplantae_odb10 -m genome
Complete slurm script called busco_flye_polish.sh can be found in the busco directory.
NOTE: When running busco you need to copy the augustus config directory to a location which you have the permision to write to and the path to the augustus config directory should be exported beforehand.
General useage of the command:
usage: busco -i [SEQUENCE_FILE] -l [LINEAGE] -o [OUTPUT_NAME] -m [MODE] [OTHER OPTIONS]
The command options we will be using:
-i FASTA FILE Input sequence file in FASTA format
-l LINEAGE Specify the name of the BUSCO lineage
-o OUTPUT Output folders and files will be labelled with this name
-m MODE BUSCO analysis mode
- geno or genome, for genome assemblies (DNA)
- tran or transcriptome, for transcriptome assemblies (DNA)
- prot or proteins, for annotated gene sets (protein)
Summary of the inital assembly assesment using BUSCO:
QUAST will be used to evaluate genome assemblies. We will be using the program QUAST which will give us the number of contigs, total length and N50 value; the data we are most interested in.
Working directory:
05_initial_assembly_evaluation/
├── quast/
- flye assembly :
quast.py ../04_assembly/flye/Assembly.fasta \
--threads 8 \
-o quast_flye
Complete slurm script called quast_flye.sh can be found in the quast directory.
- shasta assembly :
quast.py ../../04_assembly/shasta/ShastaRun/Assembly.fasta \
--threads 8 \
-o quast_shasta
Complete slurm script called quast_shasta.sh can be found in the quast directory.
- flye assembly with polish:
quast.py ../../04_assembly/flye_polish/assembly.fasta \
--threads 16 \
-o quast_flye_polish
Complete slurm script called quast_flye_polish.sh can be found in the quast directory.
Summary of the inital assembly assesment using quast:
Flye | Shasta | Flye with Polish | |
---|---|---|---|
contigs (>= 0 bp) | 9835 | 10134 | 9874 |
contigs (>= 1000 bp) | 9397 | 7942 | 9562 |
contigs (>= 5000 bp) | 7671 | 6278 | 7991 |
contigs (>= 10000 bp) | 6395 | 4826 | 6643 |
contigs (>= 25000 bp) | 4292 | 2463 | 4346 |
contigs (>= 50000 bp) | 2747 | 904 | 2706 |
Total length (>= 0 bp) | 539643975 | 220312440 | 507805683 |
Total length (>= 1000 bp) | 539340131 | 219821137 | 507598179 |
Total length (>= 5000 bp) | 534538564 | 214920226 | 503143942 |
Total length (>= 10000 bp) | 525247378 | 204318472 | 493296072 |
Total length (>= 25000 bp) | 491112867 | 165543884 | 456093461 |
Total length (>= 50000 bp) | 435500086 | 110696715 | 396803705 |
no. of contigs | 9806 | 8261 | 9828 |
Largest contig | 6909730 | 4103446 | 6326904 |
Total length | 539633067 | 220054765 | 507788936 |
GC (%) | 38.13 | 43.63 | 38.00 |
N50 | 152392 | 50297 | 128968 |
N75 | 646489 | 25268 | 56203 |
L50 | 823 | 891 | 910 |
L75 | 2207 | 2443 | 2406 |
Polishing the assemblies using Medaka. By polishing it means calculates a better consensus sequence for a draft genome assembly, find base modifications, and call SNPs with respect to a reference genome.
Medaka command:
medaka_consensus -i ${BASECALLS} -d ${DRAFT} -o ${OUTDIR} -t 16
Command options used:
medaka_consensus [-h] -i <fastx>
-i fastx input basecalls (required)
Working directory:
long_read_assembly/
├── 06_error_correction
│ ├── flye_assembly/
Complete slurm script is called medaka.sh is located in the flye_assembly
directory.
This will produce the following files along with the error corrected consensus.fasta file:
flye_assembly
├── calls_to_draft.bam
├── calls_to_draft.bam.bai
├── consensus.fasta
├── consensus_probs.hdf
└── medaka.sh
long_read_assembly/
├── 06_error_correction
│ ├── flye_assembly/
Complete slurm script is called medaka.sh is located in the shasta_assembly
directory.
This will produce the following files along with the error corrected consensus.fasta file:
shasta_assembly/
├── calls_to_draft.bam
├── calls_to_draft.bam.bai
├── consensus.fasta
├── consensus_probs.hdf
└── medaka.sh
In this section we will evaluate the assembly after error correction.
Working directory:
long_read_assembly/
├── 07_evaluation_after_error_correction/
BUSCO evaluation:
07_evaluation_after_error_correction/
├── busco/
- flye:
export AUGUSTUS_CONFIG_PATH=../../augustus/config
busco -i ../../06_error_correction/flye_assembly/consensus.fasta \
-o busco_flye -l /isg/shared/databases/BUSCO/odb10/viridiplantae_odb10 -m genome
Complete slurm script called busco_flye.sh can be found in the busco directory.
- shasta:
export AUGUSTUS_CONFIG_PATH=../../augustus/config
busco -i ../../06_error_correction/shasta_assembly/consensus.fasta \
-o busco_shasta -l /isg/shared/databases/BUSCO/odb10/viridiplantae_odb10 -m genome
Complete slurm script called busco_shasta.sh cam be found in the busco directory.
Summary of the statistics:
flye assembly stats after error correction:
C:82.4%[S:74.4%,D:8.0%],F:9.4%,M:8.2%,n:425
shasta assembly stats after error correction:
C:69.4%[S:65.9%,D:3.5%],F:10.8%,M:19.8%,n:425
Current working directory:
long_read_assembly/
├── 08_purge
Purge Haplotigs is a pipeline to help with curating genome assemblies. It assures that there is not a combination of sequences between contigs and haplotigs. It uses a system that uses the mapped reads that you assembled and Minimap2 to assess which contigs should be kept in the assembly.
We use this because some parts of a genome may have a very high degree of heterozygosity which causes contigs for both haplotypes of that part of the genome to be assembled as separate primary contigs, rather than as a contig with a haplotig which may cause an error in analysis.
What the purge halpotigs algortihm does is identify pairs of contigs that are syntenic and group them as haplotig. The pipeline uses mapped read coverage and blast/lastz alignments to determine which contigs to keep for the assembly. Dotplots are produced for all flagged contig matches to help the user analyze any remaining ambiguous contigs.
This part must be done in seperate steps as the parameters in each part depend on the results of the previous steps.
When running purge haplotigs you must run the following command separately one after the other. A detailed tutorial on how to run the purge haplotigs can be found here.
First you need to map your reads to the assembly using minimap2, after that need to sort and index the bam file with samtools index.
minimap2 -t 16 -ax map-ont ${ref} ${read_file}\
| samtools view -hF 256 - \
| samtools sort -@ 16 -m 1G -o ${bam} -T ${tmp.ali}
Then using the following command generate a coverage histogram.
purge_haplotigs readhist -b aligned.bam -g genome.fasta [ -t threads ]
REQUIRED:
-b / -bam BAM file of aligned and sorted reads/subreads to the reference
-g / -genome Reference FASTA for the BAM file.
OPTIONAL:
-t / -threads Number of worker threads to use, DEFAULT = 4, MINIMUM = 2
This will produce histogram file in png format and a BEDTools ‘genomecov’ output file which we will be needing for the step2:
Using the above histogram the cutoff values for low read depth, high read depth cutoff need to be selected. Low read cutoff been selected as the midpoint between the haploid and diploid peaks.
purge_haplotigs contigcov -i aligned.bam.genecov -l 30 -m 80 -h 145
REQUIRED:
-i / -in The bedtools genomecov output that was produced from 'purge_haplotigs readhist'
-l / -low The read depth low cutoff (use the histogram to eyeball these cutoffs)
-h / -high The read depth high cutoff
-m / -mid The low point between the haploid and diploid peaks
OPTIONAL:
-o / -out Choose an output file name (CSV format, DEFAULT = coverage_stats.csv)
-j / -junk Auto-assign contig as "j" (junk) if this percentage or greater of the contig is
low/high coverage (DEFAULT = 80, > 100 = don't junk anything)
-s / -suspect Auto-assign contig as "s" (suspected haplotig) if this percentage or less of the
contig is diploid level of coverage (DEFAULT = 80)
This will produce the following files:
├── coverage_stats.csv
The following command will run to purge the reads:
purge_haplotigs purge -b aligned.bam -g ${ref} -c coverage_stats.csv -d -a 60
Command options:
REQUIRED:
-g / -genome Genome assembly in fasta format. Needs to be indexed with samtools faidx.
-c / -coverage Contig by contig coverage stats csv file from the previous step.
-d / -dotplots Generate dotplots for manual inspection.
-a / -align_cov Percent cutoff for identifying a contig as a haplotig. DEFAULT = 70
working directory:
08_purge/
├── flye/
Step1:
purge_haplotigs readhist -b flye_aligned.bam -g ${ref} -t 16
purge_haplotigs contigcov -i flye_aligned.bam.gencov -l 3 -m 55 -h 195
Step3:
purge_haplotigs purge -b flye_aligned.bam -g ${ref} -c coverage_stats.csv -d -a 60
Complete slurm script called purge_haplotigs.sh can be found in the flye
directory. It will result in a final curated fasta file called curated.fasta.
working directory:
08_purge/
├── shasta/
Step1:
purge_haplotigs readhist -b shasta_aligned.bam -g ${ref} -t 16
purge_haplotigs contigcov -i shasta_aligned.bam.gencov -l 5 -m 34 -h 195
Step3:
purge_haplotigs purge -b shasta_aligned.bam -g ${ref} -c coverage_stats.csv -d -a 60
Complete slurm script called purge_haplotigs.sh can be found in the shasta
directory. It will result in a final curated fasta file called curated.fasta.
Working directory:
long_read_assembly/
├── 09_final_assembly_evaluation/
busco -i ../../08_purge/flye/curated.fasta \
-o flye_curated_busco -l /isg/shared/databases/BUSCO/odb10/viridiplantae_odb10 -m genome
complete slurm script called busco_flye.sh can be found in the busco
directory.
The summary of the out will contain in:
flye_curated_busco/
└── short_summary.specific.viridiplantae_odb10.flye_curated_busco.txt
This will contain the short summary of the evaluation which contains:
C:80.2%[S:72.0%,D:8.2%],F:10.6%,M:9.2%,n:425
341 Complete BUSCOs (C)
306 Complete and single-copy BUSCOs (S)
35 Complete and duplicated BUSCOs (D)
45 Fragmented BUSCOs (F)
39 Missing BUSCOs (M)
425 Total BUSCO groups searched
busco -i ../../08_purge/shasta/curated.fasta \
-o shasta_curated_busco -l /isg/shared/databases/BUSCO/odb10/viridiplantae_odb10 -m genome
complete slurm script called busco_shasta.sh can be found in the busco
directory.
The summary of the out will contain in:
shasta_curated_busco/
└── short_summary.specific.viridiplantae_odb10.shasta_curated_busco.txt
This will contain the short summary of the evaluation which contains:
C:68.5%[S:65.2%,D:3.3%],F:11.5%,M:20.0%,n:425
291 Complete BUSCOs (C)
277 Complete and single-copy BUSCOs (S)
14 Complete and duplicated BUSCOs (D)
49 Fragmented BUSCOs (F)
85 Missing BUSCOs (M)
425 Total BUSCO groups searched
QUAST will be used to evaluate genome assemblies
quast.py ../../08_purge/flye/curated.fasta \
--threads 8 \
-o quast_flye
Complete slurm script called quast_flye.sh can be found in the quast directory.
Summary of the statistics can be found in:
quast_flye
├── report.txt
quast.py ../../08_purge/shasta/curated.fasta \
--threads 8 \
-o quast_shasta
Complete slurm script called quast_shasta.sh can be found in the quast directory.
Summary of the statistics can be found in:
quast_shasta/
├── report.txt
Statistics summary of flye and shasta assemblies:
flye | shasta | |
---|---|---|
contigs (>= 0 bp) | 6549 | 8119 |
contigs (>= 1000 bp) | 6392 | 7344 |
contigs (>= 5000 bp) | 5919 | 6005 |
contigs (>= 10000 bp) | 5366 | 4727 |
contigs (>= 25000 bp) | 3971 | 2437 |
contigs (>= 50000 bp) | 2611 | 887 |
Total length (>= 0 bp) | 492754620 | 212410761 |
Total length (>= 1000 bp) | 492661682 | 212090599 |
Total length (>= 5000 bp) | 491351568 | 208075243 |
Total length (>= 10000 bp) | 487138762 | 198692220 |
Total length (>= 25000 bp) | 463778615 | 161021099 |
Total length (>= 50000 bp) | 414473760 | 106478172 |
no. of contigs | 6501 | 7587 |
Largest contig | 6892258 | 4089676 |
Total length | 492739170 | 212267457 |
GC (%) | 38.43 | 43.52 |
N50 | 160624 | 50177 |
N75 | 72963 | 25781 |
L50 | 719 | 881 |
L75 | 1872 | 2366 |
In here we will be aligning our reads to the assembed genome to see the alignment rate. Mapping of the reads will be done using the minimap2 and the generated SAM file then will be sorted into BAM format file.
minimap2 -t 16 -ax map-ont ${ref} ${read_file} \
| samtools sort -@ 16 -m 2G -o ${aligned.bam} -T ${tmp.ali}
bamtools stats -in ${aligned.bam}
-
Complete slrum script for aligning the reads to the flye assembly is called minimap2_flye.sh can be found in the minimap2 directory.
-
Complete slrum script for aligning the reads to the flye assembly is called minimap2_shasta.sh can be found in the minimap2 directory.
flye | shasta | |
---|---|---|
Total reads | 25964212 | 26949492 |
Mapped reads | 22520313 (86.736%) | 18231482 (67.6506%) |
Forward strand | 14787262 (56.9525%) | 17814705 (66.104%) |
Reverse strand | 11176950 (43.0475%) | 9134787 (33.896%) |
In this section we will be working with hybrid assemblers which will be compatible with long read PacBio Data and Short read Illumina data. Nanopore and PacBio are currently both the main long read sequencing technologies but the major differences in them are that PacBio reads a molecule multiple times to generate high-quality consensus data while Nanopore can only sequence a molecule twice. As a result, PacBio generates data with lower error rates compared to Oxford Nanopore.
To perform a hybrid assembly it requires you have both short and long read data to complete the genome. Hybrid assembly uses short read data to resolve ambiguity in the long read data as it is assembled. For this tutorial we are using data from a boxelder (Acer_negundo) genome.
First step is the conversion of RS-II pacbio data into fasta format, which is done using pbh5tools provided by pacificbiosciences. bash5tools.py
can extract read sequences and quality score values from raw and circular consensus sequencing reads to create fastq and fasta files.
bash5tools.py --outFilePrefix [PREFIX] --outType fasta --readType [Type] input.bas.h5
input.bas.h5 input .bas.h5 filename
--outFilePrefix OUTFILEPREFIX
--readType {ccs,subreads,unrolled}
--outType OUTTYPE output file type (fasta, fastq) [fasta]
Our Pacbio data is located in the Pacbio_DATA
directory
Pacbio_DATA/
└── acne_pb.fasta
CCS takes multiple subreads of the same molecule and combines them using a statistical model to produce one accurate consensus sequence (HiFi read), with base quality values. For more information, refer to the PacBio Website. For more information on CCS please refer to this PacBio CCS Tutorial. And if you have a UConn PacBio account please refer to this tutorial.
The short reads that we are going to use in this assembly is located in the ILLUMINA_DATA/
directory:
ILLUMINA_DATA/
├── XUMNS_20180703_K00134_IL100105003_S37_L007_R1.fastq.gz
├── XUMNS_20180703_K00134_IL100105003_S37_L007_R2.fastq.gz
├── XUMNS_20180703_K00134_IL100105003_S37_L008_R1.fastq.gz
└── XUMNS_20180703_K00134_IL100105003_S37_L008_R2.fastq.gz
In here we will assemble the short read illumina reads using masurca. Working directory:
hybrid_assembly/
├── Acer_negundo/
│ └── Short_Read_Assembly/
│ └── 01_masurca_assembly/
config file will look like:
DATA
PE = aa 320 20 ../../ILLUMINA_DATA/XUMNS_20180703_K00134_IL100105003_S37_L007_R1.fastq.gz ../../ILLUMINA_DATA/XUMNS_20180703_K00134_IL100105003_S37_L007_R2.fastq.gz
PE = ab 320 20 ../../ILLUMINA_DATA/XUMNS_20180703_K00134_IL100105003_S37_L008_R1.fastq.gz ../../ILLUMINA_DATA/XUMNS_20180703_K00134_IL100105003_S37_L008_R2.fastq.gz
END
PARAMETERS
EXTEND_JUMP_READS=0
GRAPH_KMER_SIZE = auto
USE_LINKING_MATES = 0
USE_GRID=0
GRID_ENGINE=SLURM
GRID_QUEUE=general
GRID_BATCH_SIZE=100000000
LHE_COVERAGE=25
MEGA_READS_ONE_PASS=0
LIMIT_JUMP_COVERAGE = 300
CA_PARAMETERS = cgwErrorRate=0.25
CLOSE_GAPS=1
NUM_THREADS = 32
JF_SIZE = 4500000000
SOAP_ASSEMBLY=0
FLYE_ASSEMBLY=0
END
The command we use is:
masurca config.txt
./assemble.sh
The slurm script called masurca_assembly.sh can be found in 01_masurca_assembly directory.
This will create the assembly using only the Illumina reads.
Final assembly file will be in the CA directory:
01_masurca_assembly/
├── CA/
│ ├── final.genome.scf.fasta
Working directory:
Short_Read_Assembly/
├── 02_evaluation/
Final assembly evaluation was done using QUAST:
quast.py ../01_masurca_assembly/CA/final.genome.scf.fasta \
--threads 16 \
-o masurca_assembly
The full slurm script called quast.sh, can be found in the 02_evaluation directory.
Resulting statistics on the assembly will be written to the report.txt file.
masurca_assembly/
├── report.txt
masurca | |
---|---|
Total length (>= 50000 bp) | 51091 |
contigs | 162899 |
Largest contig | 51091 |
Total length | 426302651 |
N50 | 4563 |
BUSCO evaluation:
busco -i ../01_masurca_assembly/CA/final.genome.scf.fasta \
-o busco -l /isg/shared/databases/BUSCO/odb10/viridiplantae_odb10 -m genome
The full slurm script is called busco.sh.
The summary of the out will contain in:
busco/
└── short_summary.specific.viridiplantae_odb10.busco.txt
This will contain the short summary of the evaluation which contains:
C:79.0%[S:77.4%,D:1.6%],F:18.4%,M:2.6%,n:425
336 Complete BUSCOs (C)
329 Complete and single-copy BUSCOs (S)
7 Complete and duplicated BUSCOs (D)
78 Fragmented BUSCOs (F)
11 Missing BUSCOs (M)
425 Total BUSCO groups searched
Working directory:
hybrid_assembly/
├── Acer_negundo/
│ └── Long_Read_Assembly/
In this section we will assemble the pacbio long reads using flye assembler.
Working directory:
Long_Read_Assembly/
├── 01_flye_assembly/
Command used:
flye --pacbio-raw ../Pacbio_DATA/acne_pb.fasta \
--genome-size 300m \
--threads 32 \
--out-dir .
The full slurm script called flye.sh cam be found in 01_flye_assembly directory.
In here we will be only using pacbio reads to do the assembly.
module load gnuplot/5.2.2
module load canu/2.1.1
canu useGrid=true \
-p Acer_negundo -d canu_out \
genomeSize=4.5g \
-pacbio /UCHC/PublicShare/CBC_Tutorials/Genome_Assembly/hybrid_assembly/Acer_negundo/Pacbio_DATA/acne_pb.fasta gridOptions="--partition=himem --qos=himem --mem-per-cpu=8000m --cpus-per-task=24"
Now we will evaluate the final assembly created by flye assembler.
Working directory:
Long_Read_Assembly/
├── 02_evaluation/
Quast command used:
quast.py ../01_flye_assembly/assembly.fasta \
--threads 16 \
-o quast
The full slurm script called quast.sh can be found in 02_evaluation directory.
Resulting statistics on the assembly will be written to the report.txt file.
masurca_assembly/
├── report.txt
flye | |
---|---|
Total length (>= 50000 bp) | 441362208 |
contigs | 1684 |
Largest contig | 8254085 |
Total length | 4451717887 |
N50 | 1762694 |
BUSCO evaluation:
busco -i ../01_flye_assembly/assembly.fasta \
-o busco -l /isg/shared/databases/BUSCO/odb10/viridiplantae_odb10 -m genome
The full slurm script is called busco.sh.
The summary of the out will contain in:
busco/
└── short_summary.specific.viridiplantae_odb10.busco.txt
This will contain the short summary of the evaluation which contains:
C:98.1%[S:95.5%,D:2.6%],F:1.2%,M:0.7%,n:425
417 Complete BUSCOs (C)
406 Complete and single-copy BUSCOs (S)
11 Complete and duplicated BUSCOs (D)
5 Fragmented BUSCOs (F)
3 Missing BUSCOs (M)
425 Total BUSCO groups searched
Working directory:
hybrid_assembly/
├── Acer_negundo/
│ └── Hybrid_Assembly/
Working directory:
Hybrid_Assembly/
├── 01_masurca_assembly/
We will be using Illumina reads and long read pacbio reads to construct a hybrid assembly using masurca. For masurca assembly we need a configuration file which directs to the short read and long reads.
DATA
PE = aa 320 20 ../../ILLUMINA_DATA/XUMNS_20180703_K00134_IL100105003_S37_L007_R1.fastq.gz ../../ILLUMINA_DATA/XUMNS_20180703_K00134_IL100105003_S37_L007_R2.fastq.gz
PE = ab 320 20 ../../ILLUMINA_DATA/XUMNS_20180703_K00134_IL100105003_S37_L008_R1.fastq.gz ../../ILLUMINA_DATA/XUMNS_20180703_K00134_IL100105003_S37_L008_R2.fastq.gz
PACBIO = /UCHC/PublicShare/CBC_Tutorials/Genome_Assembly/hybrid_assembly/Acer_negundo/Pacbio_DATA/acne_pb.fasta
END
PARAMETERS
EXTEND_JUMP_READS=0
GRAPH_KMER_SIZE = auto
USE_LINKING_MATES = 0
USE_GRID=0
GRID_ENGINE=SLURM
GRID_QUEUE=himem
GRID_BATCH_SIZE=100000000
LHE_COVERAGE=25
MEGA_READS_ONE_PASS=0
LIMIT_JUMP_COVERAGE = 300
CA_PARAMETERS = cgwErrorRate=0.25
CLOSE_GAPS=1
NUM_THREADS = 32
JF_SIZE = 4500000000
SOAP_ASSEMBLY=0
FLYE_ASSEMBLY=0
END
The full slurm script called masurca.sh is located in the masurca_assembly
directory.
Final assembly file will be in the CA directory:
01_masurca_assembly/
├── CA.mr.41.17.20.0.02/
│ ├── final.genome.scf.fasta
Working directory:
Hybrid_Assembly/
├── 02_evaluation/
Quast command used:
quast.py ../01_masurca_assembly/CA.mr.41.17.20.0.02/final.genome.scf.fasta \
--threads 16 \
-o masurca_assembly
he full slurm script called quast.sh can be found in 02_evaluation directory.
Resulting statistics on the assembly will be written to the report.txt file.
masurca_assembly/
├── report.txt
masurca | |
---|---|
Total length (>= 50000 bp) | 485855121 |
contigs | 1993 |
Largest contig | 6444170 |
Total length | 509195518 |
N50 | 143 |
BUSCO evaluation:
busco -i ../01_masurca_assembly/CA.mr.41.17.20.0.02/final.genome.scf.fasta \
-o busco -l /isg/shared/databases/BUSCO/odb10/viridiplantae_odb10 -m genome
The full slurm script is called busco.sh.
The summary of the out will contain in:
busco/
└── short_summary.specific.viridiplantae_odb10.busco.txt
This will contain the short summary of the evaluation which contains:
C:98.6%[S:90.6%,D:8.0%],F:0.7%,M:0.7%,n:425
419 Complete BUSCOs (C)
385 Complete and single-copy BUSCOs (S)
34 Complete and duplicated BUSCOs (D)
3 Fragmented BUSCOs (F)
3 Missing BUSCOs (M)
425 Total BUSCO groups searched
- NanoPack: visualizing and processing long-read sequencing data, Bioinformatics, Volume 34, Issue 15, 01 August 2018, Pages 2666–2669.
- Shasta: Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes, Nature Biotechnology volume 38, pages1044–1053(2020)