RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation
Zeqiang-Lai opened this issue · comments
PyTorch: 1.11
CudaToolKits: 11.3.1
Error occur while running this command
python tools/train.py configs/raft/raft_8x2_50k_kitti2015_288x960.py
Complete stacktrace
/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/native/TensorShape.cpp:2228.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/autograd/__init__.py:175: UserWarning: Error detected in ReluBackward0. Traceback of forward call that caused the error:
File "tools/train.py", line 209, in <module>
main()
File "tools/train.py", line 205, in main
meta=meta)
File "/media/exthdd/laizeqiang/lzq/projects/misc/mmflow/mmflow/apis/train.py", line 238, in train_model
runner.run(data_loaders, cfg.workflow)
File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run
iter_runner(iter_loaders[i], **kwargs)
File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 75, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/media/exthdd/laizeqiang/lzq/projects/misc/mmflow/mmflow/models/flow_estimators/base.py", line 90, in train_step
losses = self(**data, test_mode=False)
File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/media/exthdd/laizeqiang/lzq/projects/misc/mmflow/mmflow/models/flow_estimators/base.py", line 59, in forward
return self.forward_train(*args, **kwargs)
File "/media/exthdd/laizeqiang/lzq/projects/misc/mmflow/mmflow/models/flow_estimators/raft.py", line 107, in forward_train
feat1, feat2, h_feat, cxt_feat = self.extract_feat(imgs)
File "/media/exthdd/laizeqiang/lzq/projects/misc/mmflow/mmflow/models/flow_estimators/raft.py", line 74, in extract_feat
cxt_feat = self.context(img1)
File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/media/exthdd/laizeqiang/lzq/projects/misc/mmflow/mmflow/models/encoders/raft_encoder.py", line 296, in forward
x = res_layer(x)
File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/media/exthdd/laizeqiang/lzq/projects/misc/mmflow/mmflow/models/utils/res_layer.py", line 88, in forward
out = _inner_forward(x)
File "/media/exthdd/laizeqiang/lzq/projects/misc/mmflow/mmflow/models/utils/res_layer.py", line 76, in _inner_forward
out = self.relu(out)
File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/modules/activation.py", line 98, in forward
return F.relu(input, inplace=self.inplace)
File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/functional.py", line 1442, in relu
result = torch.relu(input)
(Triggered internally at /opt/conda/conda-bld/pytorch_1646755953518/work/torch/csrc/autograd/python_anomaly_mode.cpp:104.)
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "tools/train.py", line 209, in <module>
main()
File "tools/train.py", line 205, in main
meta=meta)
File "/media/exthdd/laizeqiang/lzq/projects/misc/mmflow/mmflow/apis/train.py", line 238, in train_model
runner.run(data_loaders, cfg.workflow)
File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run
iter_runner(iter_loaders[i], **kwargs)
File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 67, in train
self.call_hook('after_train_iter')
File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook
getattr(hook, fn_name)(self)
File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 56, in after_train_iter
runner.outputs['loss'].backward()
File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 128, 36, 120]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
PyTorch: 1.11 CudaToolKits: 11.3.1
Error occur while running this command
python tools/train.py configs/raft/raft_8x2_50k_kitti2015_288x960.py
Complete stacktrace
/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/native/TensorShape.cpp:2228.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] /home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/autograd/__init__.py:175: UserWarning: Error detected in ReluBackward0. Traceback of forward call that caused the error: File "tools/train.py", line 209, in <module> main() File "tools/train.py", line 205, in main meta=meta) File "/media/exthdd/laizeqiang/lzq/projects/misc/mmflow/mmflow/apis/train.py", line 238, in train_model runner.run(data_loaders, cfg.workflow) File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run iter_runner(iter_loaders[i], **kwargs) File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train outputs = self.model.train_step(data_batch, self.optimizer, **kwargs) File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 75, in train_step return self.module.train_step(*inputs[0], **kwargs[0]) File "/media/exthdd/laizeqiang/lzq/projects/misc/mmflow/mmflow/models/flow_estimators/base.py", line 90, in train_step losses = self(**data, test_mode=False) File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/media/exthdd/laizeqiang/lzq/projects/misc/mmflow/mmflow/models/flow_estimators/base.py", line 59, in forward return self.forward_train(*args, **kwargs) File "/media/exthdd/laizeqiang/lzq/projects/misc/mmflow/mmflow/models/flow_estimators/raft.py", line 107, in forward_train feat1, feat2, h_feat, cxt_feat = self.extract_feat(imgs) File "/media/exthdd/laizeqiang/lzq/projects/misc/mmflow/mmflow/models/flow_estimators/raft.py", line 74, in extract_feat cxt_feat = self.context(img1) File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/media/exthdd/laizeqiang/lzq/projects/misc/mmflow/mmflow/models/encoders/raft_encoder.py", line 296, in forward x = res_layer(x) File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/modules/container.py", line 141, in forward input = module(input) File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/media/exthdd/laizeqiang/lzq/projects/misc/mmflow/mmflow/models/utils/res_layer.py", line 88, in forward out = _inner_forward(x) File "/media/exthdd/laizeqiang/lzq/projects/misc/mmflow/mmflow/models/utils/res_layer.py", line 76, in _inner_forward out = self.relu(out) File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/modules/activation.py", line 98, in forward return F.relu(input, inplace=self.inplace) File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/functional.py", line 1442, in relu result = torch.relu(input) (Triggered internally at /opt/conda/conda-bld/pytorch_1646755953518/work/torch/csrc/autograd/python_anomaly_mode.cpp:104.) allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass Traceback (most recent call last): File "tools/train.py", line 209, in <module> main() File "tools/train.py", line 205, in main meta=meta) File "/media/exthdd/laizeqiang/lzq/projects/misc/mmflow/mmflow/apis/train.py", line 238, in train_model runner.run(data_loaders, cfg.workflow) File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run iter_runner(iter_loaders[i], **kwargs) File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 67, in train self.call_hook('after_train_iter') File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook getattr(hook, fn_name)(self) File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 56, in after_train_iter runner.outputs['loss'].backward() File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/_tensor.py", line 363, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/laizeqiang/miniconda3/envs/openmmlab/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 128, 36, 120]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
Hi, I am struggling with exactly similar issue. how did u solve this.
Haven't solved yet, it seems that the error was caused by inplace relu. I tried to replace all inplace relu with normal relu, but still report the same errors
Have you tried to add inplace=False
in relu? In addition, I ensured RAFT can run in pytorch 1.8 + cuda 11.1, would you like to try this env setting?
Looks good on 1.7.1.
So the issue may caused by a bug of PyTorch 1.10 I guess