UserWarning: resource_tracker: There appear to be 14 leaked semaphore objects to clean up at shutdown

Question

UserWarning: resource_tracker: There appear to be 14 leaked semaphore objects to clean up at shutdown

yj373 opened this issue 2 years ago · comments

Hello,

I am trying to train a MDEQ on the image classification task. Here is the command I used to train the image classifier
python tools/cls_train.py --cfg experiments/cifar/cls_mdeq_TINY.yaml.
Everything works fine during the pretraining stage, but when actual training starts, I get an error
UserWarning: resource_tracker: There appear to be 14 leaked semaphore objects to clean up at shutdown and the training terminates. I have tried decreasing the BATCH_SIZE_PER_GPU to 16 but still cannot solve the issue. Can anyone help me with this problem? Thanks!

yj373 · Answer 1 · Tue Feb 28 2023 01:03:08 GMT+0800 (China Standard Time)

I find that this error happens at loss.backward() right after finishing pretrainig (factor = 0.0 and deq_steps = 0). The backward_hook is called once and the returned result[''result] is a tensor with the shape of [64, 30720,1]. The model is trained on a machine using 2 GeForce RTX 2080Ti GPUs. The configuration file I used is cls_mdeq_TINY.yaml.

Zhengyang Geng · Answer 2 · Tue Feb 28 2023 02:28:10 GMT+0800 (China Standard Time)

Hi,

Thank you for your feedback! We will release a library and a model zoo for DEQs later (with systematically designed code and verified implementations). Hopefully, this can help solve the training issues.

Before that, you might refer to the DEQ-Flow's code to implement your model. Or you can use phantom grad's code to train your MDEQ.

Please wait for our release!

Thanks!

Zhengyang

Shaojie Bai · Answer 3 · Tue Mar 07 2023 03:55:44 GMT+0800 (China Standard Time)

Hi @yj373 ,

What version of pytorch are you using?

It seems that the backwawrd hook is causing the problem. If the issue still persists, I suggest that you revert to the custom backward pass approach to use the implicit differentiation. An example here: https://github.com/locuslab/mdeq/blob/master/lib/models/mdeq_forward_backward.py#L32

yj373 · Answer 4 · Tue Mar 07 2023 04:43:26 GMT+0800 (China Standard Time)

Thank you for the reply! I am using torch 1.8.1+cu101. Actually, I followed the suggestion from @Gsunshine and trained the MDEQ model using phantom grad's code. And it turns out to work fine. Thanks again!