Data preprocessing submodule for Udacity's Machine Learning Engineer Nanodegree program.
Generates the latest COVIDx Dataset for modeling; from benchmark research model first presented in [1].
1 directory, 6 filesThere are 2 ways to generate the COVIDx Dataset:
- The data preprocessing notebook (In Jupyter or Colab)
- The command-line tool
The data preprocessing notebook covidnet_data_processing.ipynb in this repo includes additional steps for generating .csv labeling files for modeling.
- Linux-based system with Python 3.7+ installed
- And/or virtualenv intalled
- A Kaggle Authentication Key (kaggle.json file)
In a terminal, get the repo via git if you don't have it on your system already, then change into the repo, create a virtual environment and activate, and run the python script:
pip3 install virtualenv
git clone https://github.com/codeamt/mle-capstone-data.git
cd mle-capstone-data-master && virtualenv .
source bin/activate
python3 get_covidx.py --kaggle_file "/path/to/your/kaggle.json"
Be sure to upload and extract the output zip file of this pipeline phase to the environment/notebook you use for the modeling phase.
This set aggregates and deduplicates examples to construct COVIDxv3 from the following sources:
- https://github.com/ieee8023/covid-chestxray-dataset
- https://github.com/agchung/Figure1-COVID-chestxray-dataset
- https://github.com/agchung/Actualmed-COVID-chestxray-dataset
- https://www.kaggle.com/tawsifurrahman/covid19-radiography-database
- https://www.kaggle.com/c/rsna-pneumonia-detection-challenge (which came from: https://nihcc.app.box.com/v/ChestXray-NIHCC)
For more notes on previous versions of the dataset, please refer to the original COVID-Net repo for more detailed documentation.
[1] L. Wang and A. Wong, “COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID19 Cases from Chest Radiography Images,” ArXiv200309871 Cs Eess, Mar. 2020 [Online]. Available: http://arxiv.org/abs/2003.09871.