Error when changing batch size value

Question

Error when changing batch size value

Marco-Nguyen opened this issue 3 years ago · comments

Excuse me, I am planning to try your LeafGAN model, and I am having a question about the batch size.

I have read the FAQ and Training Tips of CycleGAN.
I have read the Issues of CycleGAN.

and I have not found the same problem.

I tried to increase the batch size to 32, 16, ... and every time it throws an error as below (when I use a batch size of 4):

File "train.py", line 51, in <module>
    model.optimize_parameters()   # calculate loss functions, get gradients, update network weights
  File "/content/drive/My Drive/LeafGAN/LeafGAN/models/leaf_gan_model.py", line 285, in optimize_parameters
    self.forward()      # compute fake images and reconstruction images.
  File "/content/drive/My Drive/LeafGAN/LeafGAN/models/leaf_gan_model.py", line 176, in forward
    self.background_real_A, self.foreground_real_A = self.get_masking(self.real_A, self.opt.threshold)
  File "/content/drive/My Drive/LeafGAN/LeafGAN/models/leaf_gan_model.py", line 121, in get_masking
    self.netLFLSeg.backward(idx=0) # 0 for getting heatmap for "fully_leaf" class
  File "/content/drive/My Drive/LeafGAN/LeafGAN/models/grad_cam.py", line 38, in backward
    self.preds.backward(gradient=one_hot, retain_graph=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 150, in backward
    grad_tensors_ = _make_grads(tensors, grad_tensors_)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 38, in _make_grads
    + str(out.shape) + ".")
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([1, 3]) and output[0] has a shape of torch.Size([4, 3]).

This means I can only train with the default batch size (1), and it took quite a long time to train 1 epoch (for my dataset, it took 2477 seconds). That's quite time-consuming. Could you please answer my questions?

Did you use different batch sizes in your experiments?
How can fix the problem?

I will continue to search on the Internet for answers. Thank you in advance.

Huu Quan, CAP · Answer 1 · Sun Jan 09 2022 09:47:50 GMT+0800 (China Standard Time)

Hi @Marco-Nguyen

Thanks for your question!

LeafGAN is based on CycleGAN and uses the default batch size of 1.
The reason for this is (probably) enabling it to work with bigger input images.

I haven't tried to train my model with a bigger batch size yet but I think I will consider modifying the code for this.

One possible solution for fixing this problem is to modify the Image Pooling (image buffer) mechanism in CycleGAN. (See the details in CycleGAN paper, section 4. Implementation - Training details)
Specifically, CycleGAN/LeafGAN randomly selects 1 image stored in the buffer (by default, this buffer stores 50 images) and feeds it to the discriminator. I think if we can modify the code to randomly select N images (batch size) from the image buffer, then we can train with bigger batch size.

Marco-Nguyen · Answer 2 · Fri Jan 14 2022 22:54:41 GMT+0800 (China Standard Time)

I have another question, @huuquan1994 and I don't want to raise another issue.
I have read some of the issues in CycleGAN, but I am still confusing about continue training.
Let assume that I trained the first 2 epochs by setting --niter 2 --niter_decay 0 (at the end of the process: saving the model at the end of epoch 2, iters 2800, End of epoch 2 / 2 Time Taken: 5338 sec). When I return to continue training, which value should I set for:

--epoch_count
--niter

provided that I save my model by setting --save_epoch_freq 1 --save_latest_freq 100.
Currently, I set --epoch_count 3 --niter 4 to train the next 2 epoch. Is this correct?

Also, at the end of the training, I receive the message: learning rate = 0.0000000, although I set --niter_decay 0. I understand the parameter as it takes 0 epoch to linearly decay the learning rate to 0, so it should not decay to 0 after the training. Please correct me if I am wrong.

Thanks in advance, have a nice weekend!

Huu Quan, CAP · Answer 3 · Thu Jan 27 2022 09:54:54 GMT+0800 (China Standard Time)

@Marco-Nguyen
Hey, sorry for my late reply!

As I understood, the total training epochs when you train your model will be equal to --niter + --niter_decay.
This means, after n_iter epoch it will linearly decay the learning rate for niter_decay epochs until it reaches 0.

The way CycleGAN update the learning rate is in this line of code here.
So, even you set the --niter_decay to 0 the learning rate by the end will become 0. (It depends on the learning policy you set. By default, it linearly decays).

When I resume training, I normally set the --continue_train --epoch_count XX to the command.
The --epoch_count XX tells the code to load and resume training from epoch XX. By default, the --epoch_count is set to the latest epoch.

Hope this helps!

I just found that I use the old version from the original CycleGAN repo (which has not so clear code meaning). I will update the code when I have time, but at the moment, it's no problem to use the current code.