k-taxatree is a classification workflow written in R, predicting the labels of the first four taxonomic levels (kingdom, phylum, class, order) of metagenomic data with a multi-label Random Forest as the underlying model. The latter accepts as input 6-mer count vectors and as such a method to determine the appropriate k-length was also implemented.
The project utilizes data from the Earth Microbiome Project and retrieved using the R package empdata. A dataset of 91k sequences of 150bp-length targeted on the 16S rRNA gene of the V4 region. The dataset was split in two subsets, train-test and validation, consisting of 30% and 70% of the initial dataset respectefully. The subsets are available in the emp-data folder.
All data sets were collected from the ftp site of the Earth Microbiome Project.
Sample processing, sequencing, and core amplicon data analysis were performed by the Earth Microbiome Project (www.earthmicrobiome.org), and all amplicon sequence data and metadata have been made public through the EMP data portal (qiita.microbio.me/emp).
Please cite the following publication if you use any of them:
Thompson, L., Sanders, J., McDonald, D. et al. A communal catalogue reveals Earth’s multiscale microbial diversity. Nature 551, 457–463 (2017). https://doi.org/10.1038/nature24621
The packages needed to be installed, in order to run the project are:
- from CRAN
install.packages(c("parallel", "data.table", "cluster", "Rfast", "plyr", "caret", "stats", "UBL", "splitstackshape", "mlr", "mldr", "dplyr", "hash", "stringr", "randomForestSRC"))
The project can be downloaded using git:
git clone https://github.com/BiodataAnalysisGroup/k-taxatree
The project consists of 9 main scripts in the folder R-scripts containing the code required to perform all the steps from selecting the appropriate k-length and constructing the kmer matrix to predicting the labels of the validation subset. In detail:
01_k_selection_tool.R
: selection of the optimal k-length02_kmer_matrix_creation.R
: creation of the whole 6-mer matrix (4095 features)03_feature_selection.R
: steps to select the most informative features (340 features)04_model_hp_optimization.R
: stratified 10 times repeated holdout framework to determine themtry
,ntree
,predict.threshold
to achive the highest macro f1-score05_final_model.R
: creation of the final model06_unassigned_predictions.R
: utilizing the final model to predict labels for the yet unassigned sequences of the input dataset07_validation_predictions.R
: utilizing the final model to make predictions for the validation subsetcount_kmers_functions.R
: helper functions for creating the kmer matrixmultilabel_functions.R
: helper functions for the machine learning workflow.
The folder extra-scripts contains a few extra scripts used to:
0_data_for_local_usage.R
: create a subset of the training-test set for local usage, used to test the rest of the workflow in a local machinecreating_train_test_validation.R
: perform the split on the initial datasetblast_results.R
: assign taxonomies on the yet unassigned sequences of the dataset using BLASTcompare_unassigned_blast.R
: compare the k-taxatree predictions with the BLAST results on the unassigned sequencescompare_unassigned_rdp.R
: compare the RDP predictions with the BLAST results on the unassigned sequences.
The project provides the input datasets and the outputs generated in every step of the workflow. The folder emp-data includes the datasets retrieved from the Earth Microbiome Project Repository. The Output folder contains several subfolders with the outputs generated by the different scripts.
The workflow was run on an Ubuntu server of 141GB RAM and 32 cores and required a total of approximately 20 days.
Given the optimal hyperparameters (mtry
= 17, ntree
= 300, predict.threshold
= 0.2) the user can recreate the final-model using the 05_final_model.R
script. Alternatively, the model is provided in the model.rds
file in the final-model folder and can be used to make predictions. The latter is fully implemented in the scripts 06_unassigned_predictions.R
and 07_validation_predictions.R
.
For more details of the individual scripts, please refer to the wiki.
This project is licensed under the MIT License - see the LICENSE file for details.