Roko

A deep learning based tool for consensus polishing.

Description

Roko is a consensus polisher which takes draft assembly and aligned reads in BAM format and outputs a set of contigs in FASTA format. It uses deep learning architecture to produce high quality consensus. Features are represented as sampled reads in a window and labels are mapped to draft assembly in Medaka-style fashion.

Dependencies

Check HTSlib dependencies.
gcc 5.0+ and g++
python 3.6 or 3.7 (python3-dev and venv)

Installation

GPU

git clone https://github.com/lbcb-sci/roko.git roko
cd roko
make gpu

CPU

git clone https://github.com/lbcb-sci/roko.git roko
cd roko
make cpu

Usage

To activate virtual environment:

. $PROJECT_DIR/roko/bin/activate

To generate features for model training or inference:

    python features.py [options ...] <ref> <X> <o>
        <ref>
            Draft sequence in FASTA format
        <X>
            Reads aligned to <ref> in BAM format
        <o>
            Output name (e.g. output.hdf5) 
        
        options:
            --Y
                Truth genome aligned to <ref> in BAM format (training only)
            --t 
                default: 1
                Number of worker processes

To generate BAM files for feature generation pomoxis mini_align method is recommended.

To train a model:

    python train.py [options ...] <train> <out>
        <train>
            Directory containing generated .hdf5 files used for training (or one .hdf5 file)
        <out>
            Directory for saving trained model
            
        options:
            --val
                Directory containing generated .hdf5 files used for validation (or one .hdf5 file)
            --b
                default: 128
                Batch size used for train and validation
            --memory
                default: False
                If flag is present, traning and validation data is stored in RAM
            --t
                default: 0
                Number of workers for train and validation data loaders (--t for train data loader and --t for validation)

To make inference:

    python inference.py [options ...] <data> <model> <out>
        <data>
            Path to the generated features in .hdf5
        <model>
            Path to the saved model in .pth format
        <out>
            Path to the output file (FASTA format)
            
        options:
            --t
                default: 0
                Number of workers for inference
            --b
                default: 128
                Inference batch size

Comparison

The model was trained and tested on FASTQ Basecalls from Zymo R10 Native “3 Peaks”. Data was binned using Loman's script. Draft assemblies were generated using raven. BAM files used for feature generation and BAM files used for labeling were generated by mini_align script from pomoxis tool.

Organisms used for training are: B. subtilis, E. faecalis, E. coli, L. Monocytogenes and S. enterica. P. aeruginosa was used for validation. Models are tested on S. aureus. Results were evaluated using pomoxis assess_assembly script.

The (mean) results are given in the following table:

Model	Total error	Mismatch	Deletion	Insertion	Qscore
Raven	0.160%	0.040%	0.059%	0.061%	27.97
Medaka	0.037%	0.012%	0.007%	0.017%	34.30
HELEN	0.066%	0.019%	0.031%	0.016%	31.78
Roko	0.035%	0.013%	0.008%	0.013%	34.55

Total error does not correspond to the sum of errors because of rounding.

Download

The model stated in comparison section (R10, Guppy 2.3.8) can be downloaded here.

Contact information

This tool is still in an early development stage. All bugs and questions can be reported to: dominik.stanojevic@fer.hr, mile.sikic@fer.hr or mile_sikic@gis.a-star.edu.sg.

lbcb-sci / roko