RegCloser

1. Introduction

The novel robust regression framework proposed in RegCloser is a general approach to DNA sequence assembly. It is applicable to both NGS and TGS data. In combination with any scaffolding methods, it can be used as a genome gap-closing tool. In the OLC paradigm of de novo assembly, the existing methods find a layout of reads by greedy search. In contrast, the robust regression approach generates a globally optimal layout, which is the minimizer of a convex loss function.

RegCloser can currently be used or tested in the following scenarios:

Improve genome N50 by NGS libraries (a small genome example is in test/S.aureus_data), and is scalable to large genomes.
Reconstruct complete and accurate microbial genomes by low-cost and accurate NGS data (an example is in test/E.coli_simulation).
Generate a globally optimal layout in the de novo assembly of TGS data (an example is in test/E.coli_long_reads).

P.S. We tested the applicability of the robust regression approach to layout generation on TGS long reads in denovo assembly. A further complication than the gap-filling problem is that, from which strand each read comes from the target genomic DNA is unknown. Therefore, we first use a heuristic algorithm to orientate all reads in the connected graph, and then estimate their positions via robust regression.

2. Installation

You can download the software package by the command:

git clone https://github.com/csh3/RegCloser.git

or click the Download Zip button and decompress the software package.

3. Dependencies

The current version requires:

Python3 with the following modules: os, sys, re, argparse, biopython, numpy, math, networkx, scipy, collections, datetime, multiprocessing
BWA (version 0.7.17)
MultiAlignment_func_python.so

The firt two can be installed through the Bioconda channel. The third one has been included in the software package or you can compile the source code MultiAlignment_func_python.cpp on your machine using the following command.

g++ -fPIC MultiAlignment_func_python.cpp -o MultiAlignment_func_python.so -shared -I/home/miniconda3/include/python3.6m 
# Here /home/miniconda3/include/python3.6m is a directory storing the head file Python.h

4. Usage

4.1. Pipeline

The main program is RunPipeline.py, and the pipeline consists of the following 7 modules. You can start or end with any one module by the option -s or -e. The core innovation of our method lies in the module LocalAssembly that uses the robust regression model and algorithm to assemble short reads into contigs.

1. InitialContig		Break the draft genome into contigs
2. Mapping 		    	Map sequence reads to the draft genome using BWA and identify anchored reads
3. HighDepth (optional)		Identify high depth regions in the contig ends
4. CollectReads			Collect reads in the gap regions for local assembly and make tab files of linking information between contigs 	
5. Re-Scaffold (optional) 	Generate new scaffolds from initial contigs using SSPACE_Standard_v3.0
6. ReEstimateGapSize		Re-estimate gap sizes between contigs
7. LocalAssembly 		Assemble the collected reads into contigs for gap closing via the robust regression approach

We recommend to use RegCloser in an iterative way that you can take the output genome as the input of the next iteration, and perform several times until no more gaps to be filled.

4.2. Prerequisite file

A prerequisite file is needed to specify the directory storing the sequence reads and the information of different libraries. An example is illustrated below.

Reads_directory: /home/E.coli/reads
frag		frag_1.fastq		frag_2.fastq		300     20     FR    1
shortjump	shortjump_1.fastq	shortjump_2.fastq	3600    298    RF    2

The first line specifies the code path, and the second line specifies the reads directory. From the third line, each line describes one reads library and contains 7 columns, separated by spaces or tabs. Each column is explained in more detail below.

Column 1: 	Name of the library
		Each library should be designated a different name. 
Column 2 & 3: 	Fastq files for both ends
		For each paired reads, one of the reads should be in the first file, and the other one in the second file. The paired reads are required to be on the same line.
Column 4:	Average insert size between paired reads
Column 5: 	Standard deviation of insert size between paired reads
Column 6: 	Orientation of paired reads, FR or RF
		F stands for --> orientation, and R for <-- orientation. Paired-end libraries are FR, and mate-pair libraries are RF.
Column 7: 	Whether the libary is used for local assembly or making tab files, 1, 2, or 3
		1 stands for only local assembly, 2 stands for only making tab files, and 3 for both.

4.3. Basic usage

Run with 40 threads

python RunPipeline.py -p prerequisite -g draft_genome.fasta -d iter-1 -t 40

Re-run the LocalAssembly module

python RunPipeline.py -p prerequisite -g draft_genome.fasta -d iter-1 -t 40 -s LocalAssembly

Iterate over the result of RegCloser

python RunPipeline.py -p prerequisite -g iter-1/output_genome.fasta -d iter-2 -t 40

4.4. Output files

The intermediate and output files are saved under the directory specified by the option -d. RegCloser outputs 4 result files. They are described in details below.

output_genome.fasta saves the output genome sequence with gaps closed by RegCloser. You can specify the filename using option -o according to your preference.

gapSequence.fastq saves the assembled sequences in the gap regions and their Phred quality scores (ASCII_BASE 33). The identifier of each sequence records the gap it comes from. For example, @contig1_contig2_filled means the sequence filled the gap between contig1 and contig2; @contig2_contig3_left means the sequence extended from the left boundary of the gap between contig2 and contig3; @contig3_contig4_right means the sequence extended from the right boundary of the gap between contig3 and contig4.

evidence.fill records the information of the gaps in the output genome. Each line describes one gap and contains 6 columns.

Column 1: 	Left contig of a gap
Column 2:   	Right contig of a gap
Column 3 & 4:	Length of sequences extending from the left and right boundaries of a gap
		If the gap was filled, column 4 is 0, and column 3 is the length of the filling sequence.
		If column 3 is a negative value, it means the two adjacent contigs flanking the gap were merged, and column 3 tells the overlap length.
Column 5: 	Status of a gap, 0 or 1
		If the gap was filled, the flag was set to 1, otherwise 0.
Column 6: 	Current gap size 
		If the gap was filled, the value is 0.

statistics.txt records the statistics of the genome sequence after gap closing. It includes closed gap number, classified into merged gap number and filled gap number. It also includes total contig length, contig N50, and scaffold N50.

4.5. Command line options

Option	Type	Description
`-p`	STR	A formatted file specifying the code path, reads directory, and library information. [Prerequisite]
`-g`	STR	Draft genome, required.
`-d`	STR	Working directory saving intermediate and output files, required.
`-o`	STR	Output file saving gap-closed genome. [output_genome.fasta]
`-t`	INT	Number of threads. [1]
`-s`	STR	Starting module. [Start]
`-e`	STR	Ending module. [End]
`-rs`		Re-scaffold using SSPACE. [null]
`-f`	INT	Contigs shorter than this value will be removed from the draft genome before gap closing. [0]
`-ml`	INT	Contig end with insert size + ml * std bases were cut out for mapping anchored reads. [3]
`-mk`	INT	Minimum seed length in BWA mapping. [19]
`-mT`	INT	Minimum score to output in BWA mapping. [30]
`-c`	INT	Maximum number of soft-clipped bases on either end of a mapped read. [5]
`-mq`	INT	Mapped Reads with mapping quality greater than this value will be identified as anchored reads. [60]
`-nr`		Not re-map anchored reads to the whole draft genome to exclude multi-mapped reads. [null]
`-hf`		Filter out anchored reads falling in the high coverage regions. [null]
`-ra`	FLOAT	Consecutive bases with coverage higher then ra * mode coverage will be marked as high coverage regions. [1.8]
`-k`	INT	Minimum number of links to compute scaffold in SSPACE. [5]
`-a`	FLOAT	Maximum link ratio between two best contig pairs in SSPACE. [0.7]
`-sd`	INT	Default standard deviation of gap size. [100]
`-qc`	FLOAT	Maximum expected erroneous bases in the read used for local assembly. [100]
`-l`	INT	Length of the contig end sequence cut out for local assembly. [100]
`-rc`	INT	Coverage of reads used for local assembly. [100]
`-S`	FLOAT	Threshold for selective pairwise alignment. [0.3]
`-ma`	INT	Matching score in reads pairwise alignment. [1]
`-mm`	INT	Mismatch penalty in reads pairwise alignment. [20]
`-gc`	INT	Gap cost in reads pairwise alignment. [30]
`-ms`	INT	Minimum score to output in reads pairwise alignment. [20]
`-ho`	INT	Maximum admissible hanging-out length in reads pairwise alignment. [0]
`-w`		Assign initial weights for detected overlaps. [null]
`-r1`	FLOAT	Tuning constant of weight function in IRLS algorithm. [2]
`-r2`	FLOAT	Excluding samples with residuals greater than this value after IRLS algorithm. [10]
`-mT`	INT	Maximum truncated length for alignment merging adjacent contigs. [1000]
`-mA`	INT	Matching score in alignment merging adjacent contigs. [1]
`-mM`	INT	Mismatch penalty in alignment merging adjacent contigs. [2]
`-mG`	INT	Gap cost in alignment merging adjacent contigs. [3]
`-mS`	INT	Minimum alignment score to merge adjacent contigs. [20]
`-HO`	INT	Maximum admissible hanging-out length in alignment merging adjacent contigs. [5]

5. Current version

The version of the current release is v1.0.

6. Contact

Please contact cao.shenghao@foxmail.com for any questions.

7. License

GNU General Public License v3.0 only

For details, please read RegCloser/license.txt.

8. Citation

Cao S, Li M, Li LM. RegCloser: a robust regression approach to closing genome gaps. BMC Bioinformatics. 2023;24(1):249. Published 2023 Jun 13. https://doi.org/10.1186/s12859-023-05367-0

csh3 / RegCloser