RamanLab / COWAVE

COVID-19 Wave dataset, based on WHO dataset and labelled as waves based on a new definition.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

COWAVE: A Labelled COVID-19 Wave Dataset for Building Predictive Models

Predicting COVID-19 waves has posed a major challenge to the world. We attempt to create a dataset with regions of waves labelled, which can be used while building supervised learning classifiers. We also use a simple XGBoost model, to provide a minimum standard for future classifiers trained on this dataset.

Data/

WHO-COVID-19-global-data.csv

This dataset was obtained from the WHO website (https://covid19.who.int/WHO-COVID-19-global-data.csv). The new cases data till 20th October 2021 was used to generate the images in the "Results/Images (Cases)" directory. The new cases data till 27th May 2022 was used to train and test all models, and generate all other datasets.

COVID19_Dataset_v1.csv

Contains only the new cases data (from the WHO dataset), along with the country code, date of reporting and whether the day is part of a wave (1) or not (0).

COVID19_Dataset_v2.csv

Contains a list of the new cases data of all days part of a "wave" as per the predicting algoirthm. Along with this, the country code, date of reporting and whether the stretch (i.e., list of new cases) is part of a wave (1) or not (0).

COVID19_Dataset_v3.csv

The column "T21" refers to the new cases data corresponding to the date present in the "Date_reported" column. The columns "T1" to "T20" contain the new cases data for the 20 days prior to the date corresponding to "T21". The columns "Residual", "Seasonal" and "Trend" contain the components of the decomposed new cases time series. The packages and codes used for this can be found in the "Codes" repository, in the file "COWAVE.py". Along with these columns, the country code, date of reporting, and wave label are also present in the corresponding columns.

COWAVE.csv

Contains the complete dataset generated in this project. All features present as columns have been explained in detail, in the accompanying paper. This dataset was generated using the file "COWAVE_gen.py", present in the "Codes/" directory.

Codes/

All codes used to generate all datasets and images can be found in this directory.

Dependencies

pandas 1.4.2
numpy 1.22.3
numpy-base 1.22.3
matplotlib 3.5.1
matplotlib-base 3.5.1
scipy 1.7.3
statsmodels 0.13.2

COWAVE_gen.py

Contains the code used to generate "COWAVE.csv". Three functions are present: labeller_1(), labeller_2() and feature_gen(). labeller_1() takes a .csv file in the format of the WHO dataset. This dataset can be obtainied directly from the WHO website, or can be the dataset present in the "Data/" directory (if attempting to replicate the results). This function generates the "COVID19_Dataset_v1.csv" file.

labeller_2() and feature_gen() take the output of the labeller_1() function as their input. labeller_2() generates the "COVID19_Dataset_v2.csv" file. feature_gen() generates the "COWAVE.csv" file. The "COVID19_Dataset_v3.csv" file can be generated by modifying this function to stop and return after decomposing the time series (line 334).

GraphGen_Defx.py (x = 1, 2, 3)

These files are used to generate the graphs for their corresponding definitions, as outlined in the accompaying paper (i.e., GraphGen_Def1.py was used to generate the graphs for Definition 1, etc). The list of countries for which the plots are made can be changed by changing the "countrylist" list in "GraphGen_Def1.py" and "GraphGen_Def3.py" and the "x" variable in "GraphGen_Def2.py" to the required country code(s), in the WHO dataset format.

svm.py and svm.ipynb

File used to generate the baseline classifer metrics in "Table 1" in the corresponding paper.

xgb_bayesopt.py and XGB_BayesOpt.ipynb

File used to generate the improved and final classifiers, and their metrics in "Table 2", "Table 3, "Table 6, and, "Table 8". The hyperparameter tuning can be performed using Bayesian Optimization (from the package at https://github.com/fmfn/BayesianOptimization) or using RandomSearchCV. All tuning was performed using Bayesian Optimization. The hyperparameter space searched in the paper is also given in this file.

Results/Images (Cases)/

Images "2.png", "3.png", "4.png" and, "5.png" were generated using "GraphGen_Def1.py".
Images "7.png", "8.png", "9.png", "10.png" were generated using "GraphGen_Def2.py".
Images "11.png", "12.png", "13.png" and, "14.png" were generated using "GraphGen_Def3.py".
Images "1.png" and "6.png" were generated by modifying "GraphGen_Def1.py" and "GraphGen_Def2.py" respectively.

About

COVID-19 Wave dataset, based on WHO dataset and labelled as waves based on a new definition.


Languages

Language:Jupyter Notebook 53.1%Language:Python 46.9%