Looks like it is quickly overfitting

Question

Looks like it is quickly overfitting

abhinavkaul95 opened this issue 3 years ago · comments

Have been trying to replicate the results of your paper. Testing seems fine and getting same accuracy as in the paper.

But, at training time, the training accuracy jumps to around 96-98% in the second epoch and after 100 epochs also, validation accuracy is around 20%.

The logs are attached for the reference. Can you help in understanding what is going on and how can it be worked out?

referit_train_nograd_100.txt

Zhengyuan Yang · Answer 1 · Sat Mar 13 2021 00:24:51 GMT+0800 (China Standard Time)

Hi abhinavkaul95,

There seem to be two aspects of the problem.

Instead of overfitting, the difference in training/inference accuracy is caused by the different definitions of "Accu." In inference, "Accu" indicates the final Acc@0.5. However, in training, we assume the correct center and use "Accu" to understand the box regression quality. As you can imagine, it is a simple task to get a regressed box with IoU>0.5 when given the center, and that's why training "Accu" is high.
However, that doesn't explain why you fail to train the model. You might need to provide/verify the related settings, e.g., the training command and other settings. Thanks.

abhinavkaul95 · Answer 2 · Sat Mar 13 2021 19:20:23 GMT+0800 (China Standard Time)

Hi @zyang-ur,

Thanks for a speedy response and a valid point regarding high training accuracy.

For the second point, the settings and configurations are all the same as given in the code. No changes have been made. The code is being run on two GPUs on ReferItGame dataset. The same problem is being faced on Flickr30kEntities dataset also.

Training command: python3 train_yolo.py --data_root ./ln_data/ --dataset referit --gpu 0,1

Please let me know in case you would require anything else (logs, machine or processing unit architecture or specific settings) so as to look into why the results are not being replicated. What I require for now is just the replication of training to get a validation accuracy that would be comparable to the testing accuracy.

Thanks a lot again!

Zhengyuan Yang · Answer 3 · Mon Mar 15 2021 05:30:17 GMT+0800 (China Standard Time)

Hi,

I would recommend first try to reproduce with the identical setting and then step-by-step locate the problem. For example, see if using a single GPU with the default batch size and learning rate will work. I haven't try multi-GPU at that time. Thanks.

abhinavkaul95 · Answer 4 · Mon Mar 15 2021 13:24:45 GMT+0800 (China Standard Time)

All the settings are identical as I told you earlier, including the batch size and learning rate.
Also, I am facing the runtime CUDA memory error with a single GPU.

RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 7.80 GiB total capacity; 6.71 GiB already allocated; 8.44 MiB free; 6.77 GiB reserved in total by PyTorch)

So, the only way I can run the code in my system is on 2 GPUs.
Also, as per the following statement in the paper,

We observe about 1% improvement when we use larger batch sizes on a workstation with eight P100 GPUs, but we opt to report the results of the small batch size (32) so that one can easily reproduce our results on a desktop with two GPUs.

It looks like the tests have been done on multiple GPUs as you have mentioned the numbers. Please clarify. Thanks!

Zhengyuan Yang · Answer 5 · Mon Mar 22 2021 03:20:55 GMT+0800 (China Standard Time)

Sorry about the delay. Could you please try with a smaller batch size (e.g., 16 or 24) on a single GPU? Thanks.

abhinavkaul95 · Answer 6 · Mon Mar 22 2021 12:52:13 GMT+0800 (China Standard Time)

Hi @zyang-ur, as per my previous comment, I am getting a runtime error of CUDA getting out of memory while using a single GPU. I will be still giving a smaller batch size a shot, but on two GPUs.

Can you please let me know if we can take this on to a call as per your time availability if that is needed for checking what the issue is? Thanks!

abhinavkaul95 · Answer 7 · Wed Mar 31 2021 14:36:56 GMT+0800 (China Standard Time)

Hi @zyang-ur, please let me know if there are any updates.