Caribou

Alignment-free bacterial identification and classification in metagenomics sequencing data using machine learning.

Proof of Concept

The jupyter notebook workflow_example.ipynb shows the workflow and it's output using example data. In this notebook, the steps are identified for better understanding.

Data used in the workflow_example.ipynb is located in the example_data/ folder.

This data was also used to test and debug the Caribou analysis pipeline.

Installation

The Caribou analysis pipeline was developped in python3 and can be easily installed through the python wheel. The repo must be cloned first and then the package can be installed using the following commands lines in the desired folder :

git clone https://github.com/bioinfoUQAM/Caribou.git
pip install path/to/Caribou/

Dependencies

The Caribou analysis pipeline is packed with executables for all dependencies that cannot be installed through the python wheel. These dependencies are:

KronaTools

[Recommended] GPU acceleration

The learning process of machine learning models can be accelerated by using a GPU especially for Neural Networks and is strongly recommended should the user want to retrain a model.

To install GPU dependencies on your machine, refer to following tutorials for installation :

[Recommended] Python virtual environment

It is recommended to use the analysis pipeline in a virtual environment to be sure that no other installed package can interfere.
Here is an example of linux command shell to install Caribou in a new virtual environment by modifying the paths:

python3 -m venv /path/to/your/environment/folder

source /path/to/your/environment/folder/bin/activate

pip install --no-index --upgrade pip

pip install /path/to/downloaded/Caribou/folder

To access it's virtual environment later on, the user will only need to run the following two commands:

source /path/to/your/environment/folder/bin/activate

Building database

Caribou was developed having in mind that the models should be trained on the GTDB taxonomy database.
Theoritically, any database could be used to train and classify using Caribou but a certain structure should be used for feeding to the program. The specific structure of the database files necessary for training is explained in more details in the database section of the wiki.

Building GTDB database

Should the user want to build the training database from the GTDB taxonomy, this can be done using the template script to build data in one large fasta file and extract classes into a csv file. This template must be modified by the user to insert filepaths and comment the host section if there is no host to be used.

The modified template can be submitted to an HPC cluster managed by Slurm (ex: Compute Canada) using the following command :

sbatch Caribou/data/build_data_scripts/template_slurm_datagen.sh

The modified template can also be ran in a linux command shell by running the following command :

sh Caribou/data/build_data_scripts/template_slurm_datagen.sh

Finally each script used by the template can be used alone in linux command shell by running the following commands :

# Generate a list of all fastas to be merged
sh Caribou/data/build_data_scripts/generateFastaList.sh -d [directory] -o [outputFile]

# Extract classes for each bacterial genome fasta using the GTDB taxonomy
sh Caribou/data/build_data_scripts/fasta2class_bact.sh -d [directory] -i [inputFile] -c [classesFile] -o [outputDirectory]

# Extract classes for each host fasta
sh Caribou/data/build_data_scripts/fasta2class_host.sh -d [directory] -i [inputFile] -o [outputDirectory]

Usage

The Caribou analysis pipeline requires only a configuration file to be executed.
All the informations required by the program are located in this configuration file and are described in the wiki.
There is a template config file which can be found here Caribou/configs/template_config.ini.

Once the installation is done and the configuration file is ready, the following command can be used to launch the pipeline:

Caribou_pipeline.py -c path/to/your/config.ini

Partial analysis scripts

There are also partial steps scripts that can be used should the user want to.

Caribou_pipeline.py

This script runs the entire Caribou analysis Pipeline Usage : Caribou_pipeline.py [-c CONFIG_FILE]

Caribou_kmers.py

This script extracts K-mers of the given dataset using the available ressources on the computer before saving it to drive.
usage: Caribou_kmers.py [-h] [-s SEQ_FILE] [-c CLS_FILE] [-dt DATASET_NAME] [-sh SEQ_FILE_HOST] [-ch CLS_FILE_HOST] [-dh HOST_NAME] -k K_LENGTH [-l KMERS_LIST] -o OUTDIR

Caribou_extraction.py

This script trains a model and extracts bacteria / host sequences.
usage: Caribou_extraction.py [-h] -db DATA_BACTERIA [-dh DATA_HOST] -mg DATA_METAGENOME -dt DATABASE_NAME [-ds HOST_NAME] -mn METAGENOME_NAME [-model {None,onesvm,linearsvm,attention,lstm,deeplstm}] [-bs BATCH_SIZE] [-e TRAINING_EPOCHS] [-v] -o OUTDIR [-wd WORKDIR]

Caribou_classification.py

This script trains a model and classifies bacteria sequences iteratively over known taxonomic levels.
usage: Caribou_classification.py [-h] -db DATA_BACTERIA -mg DATA_METAGENOME -dt DATABASE_NAME -mn METAGENOME_NAME [-model {sgd,mnb,lstm_attention,cnn,widecnn}] [-t TAXA] [-bs BATCH_SIZE] [-e TRAINING_EPOCHS] [-v] -o OUTDIR [-wd WORKDIR]

Caribou_outputs.py

This script produces outputs from the results of classified data by Caribou.
usage: Caribou_outputs.py [-h] -db DATA_BACTERIA -clf CLASSIFIED_DATA -model {sgd,mnb,lstm_attention,cnn,widecnn} -dt DATASET_NAME [-ds HOST_NAME] [-a] [-k] [-r] [-f]

Caribou_extraction_train_cv.py

This script trains and cross-validates a model for the bacteria extraction / host removal step.
usage: Caribou_extraction_train_cv.py [-h] -db DATA_BACTERIA [-dh DATA_HOST] -dt DATABASE_NAME [-ds HOST_NAME] [-model {None,onesvm,linearsvm,attention,lstm,deeplstm}] [-bs BATCH_SIZE] [-e TRAINING_EPOCHS] [-v] -o OUTDIR [-wd WORKDIR]

Caribou_classification_train_cv.py

This script trains and cross-validates a model for the bacteria classification step.
usage: Caribou_classification_train_cv.py [-h] -db DATA_BACTERIA -dt DATABASE_NAME [-model {sgd,mnb,lstm_attention,cnn,widecnn}] [-bs BATCH_SIZE] [-e TRAINING_EPOCHS] [-v] -o OUTDIR [-wd WORKDIR]

bioinfoUQAM / Caribou