Using different dataset for negative class

Question

Using different dataset for negative class

imanpalsingh opened this issue 4 years ago · comments

Imanpal Singh commented 4 years ago

Goal

I'm trying to train a classifier to label an X-ray covid-19 positive or negative.

What I have tried

Used an architecture as defined in this research
Used X ray images with views AP, PA and AP Supine views
Used images which are marked with 'COVID-19' in finding column as positive images
Used images which are marked as 'No Finding' in finding column as negative images

Results

Due to large number of positive images the models gives ~96% by possibly predicting same class all the time (even after using augmentation)

Next steps

To get more healthy images I have decided to use this kaggle dataset.

My question is, is it okay to use two different distributions of datasets for this classification task? Also If my approach to the classification is flawed in any way.

Joseph Paul Cohen · Answer 1 · Thu Nov 05 2020 22:00:01 GMT+0800 (China Standard Time)

Uaing that kaggle dataset is as a negative example dataset is very bad because all the images are of children while this dataset is mostly adults so your model will likely learn to predict age and not the pathology.
I would check out the RSNA Pneumonia challenge dataset.
There are two papers linked at the top of the repo that are related to this issue.
I also suggest you read our paper about this dataset which discusses the possible tasks and their clinical value: https://arxiv.org/abs/2006.11988