codeamt / mle-capstone-data

data preprocessing submodule for Udacity's mle nanodegree program.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Generating COVIDx Dataset

Data preprocessing submodule for Udacity's Machine Learning Engineer Nanodegree program.

Generates the latest COVIDx Dataset for modeling; from benchmark research model first presented in [1].

Repo Contents

1 directory, 6 files

Generating Covidx Training Set

There are 2 ways to generate the COVIDx Dataset:

The Data Pre-Processing Notebook:

The data preprocessing notebook covidnet_data_processing.ipynb in this repo includes additional steps for generating .csv labeling files for modeling.

Setting up and Running data-cli-tool:

What you'll need:

  • Linux-based system with Python 3.7+ installed
  • And/or virtualenv intalled
  • A Kaggle Authentication Key (kaggle.json file)

Running Locally (Linux):

In a terminal, get the repo via git if you don't have it on your system already, then change into the repo, create a virtual environment and activate, and run the python script:

pip3 install virtualenv
git clone https://github.com/codeamt/mle-capstone-data.git
cd mle-capstone-data-master && virtualenv .
source bin/activate
python3 get_covidx.py --kaggle_file "/path/to/your/kaggle.json"

Be sure to upload and extract the output zip file of this pipeline phase to the environment/notebook you use for the modeling phase.

About the Data

This set aggregates and deduplicates examples to construct COVIDxv3 from the following sources:

For more notes on previous versions of the dataset, please refer to the original COVID-Net repo for more detailed documentation.

Chest Radiography Images Distribution

[1] L. Wang and A. Wong, “COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID19 Cases from Chest Radiography Images,” ArXiv200309871 Cs Eess, Mar. 2020 [Online]. Available: http://arxiv.org/abs/2003.09871.

About

data preprocessing submodule for Udacity's mle nanodegree program.


Languages

Language:Jupyter Notebook 69.3%Language:Python 30.7%