cannot reproduce the results on CIFAR100
JungHunOh opened this issue · comments
Dear authors,
Firstly, I would like to express my appreciation for your interesting and motivating work. Thank you for your contributions to the field.
I am writing to inquire about the codes you have provided. I have attempted to reproduce the results on CIFAR-100 using the codes, but unfortunately, I have encountered some issues. (I am unsure if these issues also occur in other datasets.)
Specifically, there seems some problem during the incremental sessions.
I observed a loss explosion after session 4.
I should mention that the experiments were conducted on the docker environment.
I am wondering if there are any problems with the current version of the codes that may have caused these issues.
I attached the log files.
20230410_150211.log
20230410_160540.log
Thank you in advance for your time and assistance. I look forward to hearing back from you soon.
Hi @JungHunOh ,
Thanks for your interest in our work.
I have a brief view of your log. I notice that you are using two gpus:
GPU 0,1: NVIDIA GeForce RTX 2080 Ti
However, you did not change the batchsize (samples_per_gpu):
data = dict(
samples_per_gpu=64,
workers_per_gpu=8,
train_dataloader=dict(persistent_workers=True),
val_dataloader=dict(persistent_workers=True),
test_dataloader=dict(persistent_workers=True),
...
So, the batchsize in total will be totally different, which may cause very different results.
Please consider run the code on a 8-gpu machine to re-produce the results.
If you insist to run on 2-gpu machine, please consider change
samples_per_gpu=64,
to
samples_per_gpu=256,
But I want to note that this will cause a different result since some subtle difference inside the PyTorch implementation.
Regards,
Haobo Yuan
Hi @JungHunOh ,
I would like to close the issue first, feel free to re-open it or raise a new one if you have any other question.
Thanks again for your interests.
Best,
Haobo Yuan
Thank you very much for your detailed answers.
My concerns are resolved.