RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

Question

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

Bananavision opened this issue 2 years ago · comments

Running a custom data set on torch v 1.5 gives the following

Traceback (most recent call last): File "/home/usr/project/solov2/SOLO/tools/train.py", line 125, in <module> main() File "/home/usr/project/solov2/SOLO/tools/train.py", line 115, in main train_detector( File "/home/usr/project/solov2/SOLO/mmdet/apis/train.py", line 107, in train_detector _non_dist_train( File "/home/usr/project/solov2/SOLO/mmdet/apis/train.py", line 299, in _non_dist_train runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File "/home/usr/project/solov2/lib/python3.9/site-packages/mmcv-0.2.16-py3.9-linux-x86_64.egg/mmcv/runner/runner.py", line 364, in run epoch_runner(data_loaders[i], **kwargs) File "/home/usr/project/solov2/lib/python3.9/site-packages/mmcv-0.2.16-py3.9-linux-x86_64.egg/mmcv/runner/runner.py", line 275, in train self.call_hook('after_train_iter') File "/home/usr/project/solov2/lib/python3.9/site-packages/mmcv-0.2.16-py3.9-linux-x86_64.egg/mmcv/runner/runner.py", line 231, in call_hook getattr(hook, fn_name)(self) File "/home/usr/project/solov2/lib/python3.9/site-packages/mmcv-0.2.16-py3.9-linux-x86_64.egg/mmcv/runner/hooks/optimizer.py", line 19, in after_train_iter runner.outputs['loss'].backward() File "/home/usr/project/solov2/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/usr/project/solov2/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward Variable._execution_engine.run_backward( RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 128, 336, 200]], which is output 0 of ReluBackward0, is at version 3; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Is this a specific error for torch >= 1.5? What is the current workaround?

Thanks

haqishen · Answer 1 · Thu Nov 11 2021 11:58:52 GMT+0800 (China Standard Time)

Hi, I've got the same issue with you.

My env:
cuda 11.3
pytorch 1.10

I fix the bug by this modification, please have try ;)
#204

iYold · Answer 2 · Wed Aug 31 2022 20:13:33 GMT+0800 (China Standard Time)

I think this is a specific error for torch >= 1.5. I got this error too and I was already using CUDA 10.1 but I downgrade Pytorch version to 1.4. I fixed the bug that way.
PS: Before downgrade Pytorch version I tried many things. (Basically I did everything that I found. For instance, I changed nn.relu(inplace=True) to False). I added .step after ReLu backward etc.)

Abhishek Agrawal · Answer 3 · Tue Feb 07 2023 23:11:55 GMT+0800 (China Standard Time)

How to handle this if I don't want to downgrade my pytorch version ?