Using different dataset for negative class
imanpalsingh opened this issue · comments
Goal
I'm trying to train a classifier to label an X-ray covid-19 positive or negative.
What I have tried
-
Used an architecture as defined in this research
-
Used X ray images with views AP, PA and AP Supine views
-
Used images which are marked with 'COVID-19' in finding column as positive images
-
Used images which are marked as 'No Finding' in finding column as negative images
Results
Due to large number of positive images the models gives ~96% by possibly predicting same class all the time (even after using augmentation)
Next steps
To get more healthy images I have decided to use this kaggle dataset.
My question is, is it okay to use two different distributions of datasets for this classification task? Also If my approach to the classification is flawed in any way.
Uaing that kaggle dataset is as a negative example dataset is very bad because all the images are of children while this dataset is mostly adults so your model will likely learn to predict age and not the pathology.
I would check out the RSNA Pneumonia challenge dataset.
There are two papers linked at the top of the repo that are related to this issue.
I also suggest you read our paper about this dataset which discusses the possible tasks and their clinical value: https://arxiv.org/abs/2006.11988