Perturbome modeling with Tidymodels

Here I attempt to reproduce results from A first perturbome of Pseudomonas aeruginosa: Identification of core genes related to multiple perturbations by a machine learning approach, using the Tidymodels framework instead of caret.

Original code of the paper can be found here: Molina-Mora et al., 2021

For this paper, three models were built for identification of top genes: a Random Forest, a Support Vector Machine and a K-nearest neighbor.

About variable importance

Rather than caret::varImp, Tidymodels tipically relies on the vip package to calculate variable importance. vip allows calculating model-specific feature importance, which was used for the random forest model.

For models such as the SVM and KNN vip also allows model agnostic calculations. Here I used FIRM based on Greenwell et al (2018).

The large number of features in this dataset may turn this calculations slow.

Single partition:

Random Forest model

Support Vector Machine model

accuracy: 0.773
roc_auc: 0.906 SVM seems more likely to predict "Perturbation".

K-nearest neighbor model

accuracy: 0.773
roc_auc: 0.739

Multiple partition Random Forest

The multiple partition method is executed with classid_ind_tidy.R. As of right now it only uses de Random Forest model with 6 replicas (paper uses 100 replicas).

About

Results reproduction from Molina-Mora et al., 2021 using the tidymodels framework

bioinformatics

Languages

Language:R 100.0%