HsinYingLee / DRIT

Learning diverse image-to-image translation from unpaired data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RuntimeError when training

zycliao opened this issue · comments

Hi, thanks for this amazing job.
Without changing any codes, I encountered an error when I ran python train.py. Any idea why this happened?

File "E:/project/DRIT/src/train.py", line 78, in
main()
File "E:/project/DRIT/src/train.py", line 52, in main
model.update_EG()
File "E:\project\DRIT\src\model.py", line 301, in update_EG
self.backward_G_alone()
File "E:\project\DRIT\src\model.py", line 401, in backward_G_alone
loss_z_L1.backward()
File "xxxxxx\anaconda3\lib\site-packages\torch\tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "xxxxxx\anaconda3\lib\site-packages\torch\autograd_init_.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256, 8]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Thanks for raising the problem up! What is your Pytorch version?

I meet the same problem and my Pytorch version is 1.6.0.

I have the same issue.

same issue on pytorch 1.8

My PyTorch version is 0.4.0, but I have the same issue!!!

I had the same problem, and I solved it with this configuration: pytorch==1.3.1, torchvision==0.4.2. Hope it can help you.

I had the same problem, and I solved it with this configuration: pytorch==1.3.1, torchvision==0.4.2. Hope it can help you.

good advice!!!
Now I solve it with your recommendation

Same problem here. pytorch==1.7.1, torchvision==0.8.2. Can't downgrade to pytorch==1.3.1 because I'm not an admin on the box I'm using, and would need to downgrade CUDA to run the earlier pytorch version.

Looking at some other discussions involving this issue, apparently the older versions of pytorch work because they're not properly checking for this problem of inline operators. See https://discuss.pytorch.org/t/solved-pytorch1-5-runtimeerror-one-of-the-variables-needed-for-gradient-computation-has-been-modified-by-an-inplace-operation/90256. In that same post, they had problems with a GAN. Changing the order of updates to the discriminator and generator ended up fixing it.

Long story short, I can get it to run by modifying model.py:update_EG() to run the forward() step once more before backward_G_alone().

Original:

  def update_EG(self):
    # update G, Ec, Ea
    self.enc_c_opt.zero_grad()
    self.enc_a_opt.zero_grad()
    self.gen_opt.zero_grad()
    self.backward_EG()
    self.enc_c_opt.step()
    self.enc_a_opt.step()
    self.gen_opt.step()

    # update G, Ec
    self.enc_c_opt.zero_grad()
    self.gen_opt.zero_grad()
    self.backward_G_alone()
    self.enc_c_opt.step()
    self.gen_opt.step()

Modified:

  def update_EG(self):
    # update G, Ec, Ea
    self.enc_c_opt.zero_grad()
    self.enc_a_opt.zero_grad()
    self.gen_opt.zero_grad()
    self.backward_EG()
    self.enc_c_opt.step()
    self.enc_a_opt.step()
    self.gen_opt.step()

    # update G, Ec
    self.enc_c_opt.zero_grad()
    self.gen_opt.zero_grad()
    self.forward()
    self.backward_G_alone()
    self.enc_c_opt.step()
    self.gen_opt.step()

No clue if this is correct though, or why it would need another forward() step. Will post later on whether cat2dog works after this change.

I did attempt to track things down with autograd.set_detect_anomaly(True) at the top of model.py:forward(). Here's what it spit out. I wasn't able to find an inline operation related to the line in networks.py it flagged though. I also tried using torchinfo.summary() to print tensor shapes through some of the networks, but never found anything with size [8, 256, 1, 1] Maybe someone with more familiarity with this project can figure it out.

[W python_anomaly_mode.cpp:104] Warning: Error detected in CudnnConvolutionBackward. Traceback of forward call that caused the error:
  File "train.py", line 78, in <module>
    main()
  File "train.py", line 51, in main
    model.update_D(images_a, images_b)
  File "git/DRIT/src/model.py", line 224, in update_D
    self.forward()
  File "git/DRIT/src/model.py", line 201, in forward
    self.z_attr_random_a, self.z_attr_random_b = self.enc_a.forward(self.fake_A_random, self.fake_B_random)
  File "git/DRIT/src/networks.py", line 185, in forward
    xb = self.model_b(xb)
  File "pytorch-g3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "pytorch-g3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "pytorch-g3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "pytorch-g3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 423, in forward
    return self._conv_forward(input, self.weight)
  File "pytorch-g3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
    self.padding, self.dilation, self.groups)
 (function _print_stack)
Traceback (most recent call last):
  File "train.py", line 78, in <module>
    main()
  File "train.py", line 52, in main
    model.update_EG()
  File "git/DRIT/src/model.py", line 309, in update_EG
    self.backward_G_alone()
  File "git/DRIT/src/model.py", line 409, in backward_G_alone
    loss_z_L1.backward()
  File "/pytorch-g3/lib/python3.6/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "pytorch-g3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 132, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [8, 256, 1, 1]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

@wilsonjwcsu
Same problem with pytorch=1.8.1
coincidentally i saw the https://discuss.pytorch.org/t/solved-pytorch1-5-runtimeerror-one-of-the-variables-needed-for-gradient-computation-has-been-modified-by-an-inplace-operation/90256.
And to tackle with this problem i simply changed the order of the backward() and step()
See:
From:

  def update_EG(self):
    # update G, Ec, Ea
    self.enc_c_opt.zero_grad()
    self.enc_a_opt.zero_grad()
    self.gen_opt.zero_grad()
    self.backward_EG()
    self.enc_c_opt.step()
    self.enc_a_opt.step()
    self.gen_opt.step()
    # update G, Ec
    self.enc_c_opt.zero_grad()
    self.gen_opt.zero_grad()
    self.backward_G_alone()
    self.enc_c_opt.step()
    self.gen_opt.step()

To:

  def update_EG(self):
    # update G, Ec, Ea
    self.enc_c_opt.zero_grad()
    self.enc_a_opt.zero_grad()
    self.gen_opt.zero_grad()
    #do backward()
    self.backward_EG()
    # update G, Ec
    self.backward_G_alone()
    # do optimisation
    self.enc_c_opt.step()
    self.enc_a_opt.step()
    self.gen_opt.step()

while inside the function "self.backward_EG()" you need to set loss_G.backward(retain_graph=True) (same as Author code) which may needs more memory.
in this situation, you only need one single forward() with one single step() so that the parameters won't get inplace-operation (when doing step(), the leaf-tensor would be inplaced by x = x - lr * x.grad)

Thanks @LiaoFJ . Does it still behave the same after your changes? I'm curious as to @HsinYingLee 's reasons for the sequence backward_EG() -> optimizer steps -> backward_G_alone() -> optimizer steps instead of backward_EG()->backward_G_alone()->optimizer steps.

Thanks @wilsonjwcsu, in my view I guess there is almost the same while the difference is that the parameters update twice seperately or once together.
Sorry I didn't do control experiment so I am not sure. Thanks for your remind and I will try to test if it behaves same or not after changes and will post it later.

@LiaoFJ Thank you for your advice. I'm also curious about the question that @wilsonjwcsu mentioned.
I think the error is caused by inplace modifying self.enc_c and self.gen. When calling self.backward_G_alone() and update self.enc_c again, it will raise RuntimeError.
The solution is call self.forward() again so that variables used in self.backward_G_alone() in calculated from new self.enc_c and self.gen.
The new code is like this:

  def update_EG(self):
    # update G, Ec, Ea
    self.enc_c_opt.zero_grad()
    self.enc_a_opt.zero_grad()
    self.gen_opt.zero_grad()
    self.backward_EG()
    self.enc_c_opt.step()
    self.enc_a_opt.step()
    self.gen_opt.step()
    # update G, Ec
    self.enc_c_opt.zero_grad()
    self.gen_opt.zero_grad()
    self.forward(). # call forward() to using new network parameters to compute variables
    self.backward_G_alone()
    self.enc_c_opt.step()
    self.gen_opt.step()