SimonAB / ViralHostPredictor

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Access the corresponding paper as published in Science here.

Predicting Reservoir Hosts and Arthropod Vectors from Evolutionary Signatures in RNA Virus Genomes

Simon A. Babayan, Richard J. Orton and Daniel G. Streicker

Background

A series of scripts and datasets described in Babayan et al. (2018) Science which predict the reservoir hosts, existence of arthropod vectors and identity of arthropod vectors using gradient boosting machines.

File descriptions

Datasets:

BabayanEtAl_sequences.fasta contains coding sequences for all viruses used in the analyses

EbolaTimeSeriesData.csv contains epidemiological data and genomic features for Zaire ebolaviruses sampled during the 2014-2016 West African outbreak

BabayanEtAl_VirusData.csv contains reservoir host, arthropod-borne transmission status and vector taxa for all ssRNA viruses analyzed and features extracted from the genome of each virus

R scripts:

arthropodBorne_featureSelection.R Uses gradient boosting machines in h2o to estimate average feature importances for predicting arthropod-borne transmission across different training sets

arthropodBorne_PN+selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the arthropod-borne transmission status of each virus using phylogenetic neighborhoods and genomic features selected by arthropodBorne_featureSelection.R

arthropodBorne_PN.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the arthropod-borne transmission status of each virus using phylogenetic neighborhoods

arthropodBorne_selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the arthropod-borne transmission status of each virus using genomic features selected by arthropodBorne_featureSelection.R

reservoir_featureSelection.R Uses gradient boosting machines in h2o to estimate average feature importances for predicting reservoir hosts across different training sets

reservoirPredict_PN+selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the reservoir host of each virus using phylogenetic neighborhoods and genomic features selected by reservoir_featureSelection.R

reservoirPredict_PN.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the reservoir host of each virus using phylogenetic neighborhoods

reservoirPredict_selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the reservoir host of each virus using genomic features selected by reservoir_featureSelection.R

vectorPredict_featureSelection.R Uses gradient boosting machines in h2o to estimate average feature importances for predicting reservoir hosts across different training sets

vectorPredict_PN+selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the vector of each virus using phylogenetic neighborhoods and genomic features selected by vectorPredict_featureSelection.R

vectorPredict_PN.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the vector of each virus using phylogenetic neighborhoods

vectorPredict_selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the vector of each virus using genomic features selected by vectorPredict_featureSelection.R

Python script

algo_comparison.py Compares the predictive power of a variety of competing machine learning algorithms to predict reservoir hosts, arthropod-borne transmission and vector taxa from all possible genomic features

About

License:GNU General Public License v3.0


Languages

Language:R 95.5%Language:Python 4.5%