Problems occured when reimplement COSOC

Question

Problems occured when reimplement COSOC

Taylorfire opened this issue 3 years ago · comments

ytluo commented 3 years ago

Hi, when i tried to reimplement COSOC, I was confronted with two problems:

Multi-GPU training: I followed the guidance "4. Training COSOC" and finished the training of examplar and running of COS algorithm. However, when I tried to use 2 TITAN V(12G) GPUs to running FSL algorithm with COS, it would cause error "CUDA out of memory". More exactly, as long as validation went on, it would cause such a problem. (When training was on but validation hadn't start it went on normally.)
I didn't modify any hyperparameters, the batchsize in this stage is still 128.
Training with single GPU and smaller batchsize: Given the problem in 1, I also tried training on single GPU with batchsize=32(the max supportable bs). But the validation results seemed as if nothing had been learned (36/60 epochs):

It would be very thankful for your reply!

Xu Luo · Answer 1 · Mon Dec 27 2021 21:24:07 GMT+0800 (China Standard Time)

Hi, thanks for reporting the bug! It is now fixed, please try again. (modification of set_config_COSOC.py and SOC.py. I change the val shot, the learning rate and the epoch number, and also add a plugin that removes the warning. A bug in SOC.py is fixed.)

ytluo · Answer 2 · Mon Dec 27 2021 22:14:24 GMT+0800 (China Standard Time)

Hi, thanks for reporting the bug! It is now fixed, please try again. (modification of set_config_COSOC.py and SOC.py. I change the val shot, the learning rate and the epoch number, and also add a plugin that removes the warning. A bug in SOC.py is fixed.)

Thanks for your responsible reply and solution! It now can be sucessfully trained with multi-gpus. I am waiting for a reasonable result and this may cost a period of time. Hope you would not mind my keeping this issue open for some while.