RuntimeError when training

Question

RuntimeError when training

zycliao opened this issue 4 years ago · comments

Zhouyingcheng Liao(廖周应成) commented 4 years ago

Hi, thanks for this amazing job.
Without changing any codes, I encountered an error when I ran python train.py. Any idea why this happened?

File "E:/project/DRIT/src/train.py", line 78, in
main()
File "E:/project/DRIT/src/train.py", line 52, in main
model.update_EG()
File "E:\project\DRIT\src\model.py", line 301, in update_EG
self.backward_G_alone()
File "E:\project\DRIT\src\model.py", line 401, in backward_G_alone
loss_z_L1.backward()
File "xxxxxx\anaconda3\lib\site-packages\torch\tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "xxxxxx\anaconda3\lib\site-packages\torch\autograd_init_.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256, 8]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Hung-Yu Tseng · Answer 1 · Tue Jul 21 2020 09:25:11 GMT+0800 (China Standard Time)

Thanks for raising the problem up! What is your Pytorch version?

meet-cjli · Answer 2 · Sat Sep 26 2020 20:38:32 GMT+0800 (China Standard Time)

I meet the same problem and my Pytorch version is 1.6.0.

korawat-tanwisuth · Answer 3 · Sun Dec 06 2020 08:20:15 GMT+0800 (China Standard Time)

I have the same issue.

Prabhat Kumar · Answer 4 · Mon Dec 21 2020 21:30:36 GMT+0800 (China Standard Time)

same issue on pytorch 1.8

mianbaoshugege · Answer 5 · Wed Mar 31 2021 16:37:11 GMT+0800 (China Standard Time)

My PyTorch version is 0.4.0, but I have the same issue!!!

echolijinghui · Answer 6 · Thu Apr 08 2021 14:40:00 GMT+0800 (China Standard Time)

I had the same problem, and I solved it with this configuration: pytorch==1.3.1, torchvision==0.4.2. Hope it can help you.

mianbaoshugege · Answer 7 · Thu Apr 08 2021 14:59:33 GMT+0800 (China Standard Time)

I had the same problem, and I solved it with this configuration: pytorch==1.3.1, torchvision==0.4.2. Hope it can help you.

good advice！！！
Now I solve it with your recommendation

Jesse Wilson · Answer 8 · Thu Jun 24 2021 06:02:53 GMT+0800 (China Standard Time)

Same problem here. pytorch==1.7.1, torchvision==0.8.2. Can't downgrade to pytorch==1.3.1 because I'm not an admin on the box I'm using, and would need to downgrade CUDA to run the earlier pytorch version.

Looking at some other discussions involving this issue, apparently the older versions of pytorch work because they're not properly checking for this problem of inline operators. See https://discuss.pytorch.org/t/solved-pytorch1-5-runtimeerror-one-of-the-variables-needed-for-gradient-computation-has-been-modified-by-an-inplace-operation/90256. In that same post, they had problems with a GAN. Changing the order of updates to the discriminator and generator ended up fixing it.

Long story short, I can get it to run by modifying model.py:update_EG() to run the forward() step once more before backward_G_alone().

Original:

  def update_EG(self):
    # update G, Ec, Ea
    self.enc_c_opt.zero_grad()
    self.enc_a_opt.zero_grad()
    self.gen_opt.zero_grad()
    self.backward_EG()
    self.enc_c_opt.step()
    self.enc_a_opt.step()
    self.gen_opt.step()

    # update G, Ec
    self.enc_c_opt.zero_grad()
    self.gen_opt.zero_grad()
    self.backward_G_alone()
    self.enc_c_opt.step()
    self.gen_opt.step()

Modified:

  def update_EG(self):
    # update G, Ec, Ea
    self.enc_c_opt.zero_grad()
    self.enc_a_opt.zero_grad()
    self.gen_opt.zero_grad()
    self.backward_EG()
    self.enc_c_opt.step()
    self.enc_a_opt.step()
    self.gen_opt.step()

    # update G, Ec
    self.enc_c_opt.zero_grad()
    self.gen_opt.zero_grad()
    self.forward()
    self.backward_G_alone()
    self.enc_c_opt.step()
    self.gen_opt.step()

No clue if this is correct though, or why it would need another forward() step. Will post later on whether cat2dog works after this change.

I did attempt to track things down with autograd.set_detect_anomaly(True) at the top of model.py:forward(). Here's what it spit out. I wasn't able to find an inline operation related to the line in networks.py it flagged though. I also tried using torchinfo.summary() to print tensor shapes through some of the networks, but never found anything with size [8, 256, 1, 1] Maybe someone with more familiarity with this project can figure it out.

[W python_anomaly_mode.cpp:104] Warning: Error detected in CudnnConvolutionBackward. Traceback of forward call that caused the error:
  File "train.py", line 78, in <module>
    main()
  File "train.py", line 51, in main
    model.update_D(images_a, images_b)
  File "git/DRIT/src/model.py", line 224, in update_D
    self.forward()
  File "git/DRIT/src/model.py", line 201, in forward
    self.z_attr_random_a, self.z_attr_random_b = self.enc_a.forward(self.fake_A_random, self.fake_B_random)
  File "git/DRIT/src/networks.py", line 185, in forward
    xb = self.model_b(xb)
  File "pytorch-g3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "pytorch-g3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "pytorch-g3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "pytorch-g3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 423, in forward
    return self._conv_forward(input, self.weight)
  File "pytorch-g3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
    self.padding, self.dilation, self.groups)
 (function _print_stack)
Traceback (most recent call last):
  File "train.py", line 78, in <module>
    main()
  File "train.py", line 52, in main
    model.update_EG()
  File "git/DRIT/src/model.py", line 309, in update_EG
    self.backward_G_alone()
  File "git/DRIT/src/model.py", line 409, in backward_G_alone
    loss_z_L1.backward()
  File "/pytorch-g3/lib/python3.6/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "pytorch-g3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 132, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [8, 256, 1, 1]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

LiaoFJ · Answer 9 · Sat Jun 26 2021 15:23:09 GMT+0800 (China Standard Time)

@wilsonjwcsu
Same problem with pytorch=1.8.1
coincidentally i saw the https://discuss.pytorch.org/t/solved-pytorch1-5-runtimeerror-one-of-the-variables-needed-for-gradient-computation-has-been-modified-by-an-inplace-operation/90256.
And to tackle with this problem i simply changed the order of the backward() and step()
See:
From:

  def update_EG(self):
    # update G, Ec, Ea
    self.enc_c_opt.zero_grad()
    self.enc_a_opt.zero_grad()
    self.gen_opt.zero_grad()
    self.backward_EG()
    self.enc_c_opt.step()
    self.enc_a_opt.step()
    self.gen_opt.step()
    # update G, Ec
    self.enc_c_opt.zero_grad()
    self.gen_opt.zero_grad()
    self.backward_G_alone()
    self.enc_c_opt.step()
    self.gen_opt.step()

To:

  def update_EG(self):
    # update G, Ec, Ea
    self.enc_c_opt.zero_grad()
    self.enc_a_opt.zero_grad()
    self.gen_opt.zero_grad()
    #do backward()
    self.backward_EG()
    # update G, Ec
    self.backward_G_alone()
    # do optimisation
    self.enc_c_opt.step()
    self.enc_a_opt.step()
    self.gen_opt.step()

while inside the function "self.backward_EG()" you need to set loss_G.backward(retain_graph=True) (same as Author code) which may needs more memory.
in this situation, you only need one single forward() with one single step() so that the parameters won't get inplace-operation (when doing step(), the leaf-tensor would be inplaced by x = x - lr * x.grad)

Jesse Wilson · Answer 10 · Sat Jul 10 2021 02:19:25 GMT+0800 (China Standard Time)

Thanks @LiaoFJ . Does it still behave the same after your changes? I'm curious as to @HsinYingLee 's reasons for the sequence backward_EG() -> optimizer steps -> backward_G_alone() -> optimizer steps instead of backward_EG()->backward_G_alone()->optimizer steps.

LiaoFJ · Answer 11 · Tue Jul 13 2021 11:21:36 GMT+0800 (China Standard Time)

Thanks @wilsonjwcsu, in my view I guess there is almost the same while the difference is that the parameters update twice seperately or once together.
Sorry I didn't do control experiment so I am not sure. Thanks for your remind and I will try to test if it behaves same or not after changes and will post it later.

adot08 · Answer 12 · Mon Aug 08 2022 15:14:48 GMT+0800 (China Standard Time)

@LiaoFJ Thank you for your advice. I'm also curious about the question that @wilsonjwcsu mentioned.
I think the error is caused by inplace modifying self.enc_c and self.gen. When calling self.backward_G_alone() and update self.enc_c again, it will raise RuntimeError.
The solution is call self.forward() again so that variables used in self.backward_G_alone() in calculated from new self.enc_c and self.gen.
The new code is like this:

  def update_EG(self):
    # update G, Ec, Ea
    self.enc_c_opt.zero_grad()
    self.enc_a_opt.zero_grad()
    self.gen_opt.zero_grad()
    self.backward_EG()
    self.enc_c_opt.step()
    self.enc_a_opt.step()
    self.gen_opt.step()
    # update G, Ec
    self.enc_c_opt.zero_grad()
    self.gen_opt.zero_grad()
    self.forward(). # call forward() to using new network parameters to compute variables
    self.backward_G_alone()
    self.enc_c_opt.step()
    self.gen_opt.step()