lower AUROCs than the author got

Question

lower AUROCs than the author got

yoheimatt opened this issue 6 years ago · comments

Thank you for sharing the code with the community.
I ran the Keras version of the code.
It seems I was unable to get the AUROC close to what you have gotten:

0 Atelectasis 0.689804
1 Cardiomegaly 0.699429
2 Effusion 0.769636
3 Infiltration 0.655084
4 Mass 0.601279
5 Nodule 0.571633
6 Pneumonia 0.634000
7 Pneumothorax 0.677171
8 Consolidation 0.725847
9 Edema 0.817075
10 Emphysema 0.603675
11 Fibrosis 0.660121
12 Pleural_Thickening 0.650140
13 Hernia 0.647572

How many epochs do I need to run?

George Iordanescu · Answer 1 · Wed Sep 26 2018 03:04:14 GMT+0800 (China Standard Time)

First of all, thank you for trying the code. Please feel free to add pain points. I am sure there were a few. We are also working on a streamlined version that will drop the deprecated Workbench and leverage the much more useful recently released AML SDK.
About the classif performance issue, you should try around 200 epochs. The value used in the repo (1) is just for demo purposes. How many epochs did u use? If you are using Azure DLVM for training, you could scale its size up to reduce time. I think on an NC12 (2 GPUs) it will take days (about 20 to 30 minutes per epoch)

yoheimatt · Answer 2 · Wed Sep 26 2018 08:57:19 GMT+0800 (China Standard Time)

Thank you for your quick reply. I did find a small potential issue. In https://github.com/Azure/AzureChestXRay/blob/master/AzureChestXRay_AMLWB/Code/src/azure_chestxray_utils.py
I think there could be an underscore missing in 'Pleural Thickening'. Without it, the processing create zero case of positive Pleural Thickening.

I will follow your suggestion of running 200 epochs. To be honest, I ran out of patience and stopped the training at 50th epoch after I didn't see much improvement. And you are right, it takes less than 30 minutes per epoch.

Stexan · Answer 3 · Sun Jan 20 2019 02:02:57 GMT+0800 (China Standard Time)

Hello @georgeAccnt-GH, and thank you very much for your implementation of the study! Our team has also tried to replicate your results, and while we have better results than the original poster, we still didn't reach your AUC (you have a mean of 0,84 and we have a mean of 0,81).

What happens is that around epoch 30-35, the algorithms starts overfitting, so training becomes useless as performance on the valid/test sets just drops. We have followed the exact same steps that you have implemented yourself.

Do you think the data splits have an impact and the difference might come from there? Or is there anything else you did specifically to make the network not overfit so fast (we have also tried random crops along with the augmentation techniques used in your implementation, but that didn't help much either)?