RuntimeError: stack expects each tensor to be equal size at train_dataloader

Question

RuntimeError: stack expects each tensor to be equal size at train_dataloader

fatemehtd opened this issue 4 months ago · comments

Fatemeh Taheri Dezaki commented 4 months ago

I was trying to run training script, but in training loop once train dataloader is called it gives the RuntimeError. I am training on 8 GPUs and the sizes in each is different, even different attempts results in different dimensions. Could you please let me know how to fix this error?

Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12365/12365 [00:00<00:00, 19870.97it/s]
02/06/2024 23:19:07 - INFO - main - ***** Running training *****
02/06/2024 23:19:07 - INFO - main - Num examples = 12302
02/06/2024 23:19:07 - INFO - main - Num Epochs = 100
02/06/2024 23:19:07 - INFO - main - Instantaneous batch size per device = 4
02/06/2024 23:19:07 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 128
02/06/2024 23:19:07 - INFO - main - Gradient Accumulation steps = 4
02/06/2024 23:19:07 - INFO - main - Total optimization steps = 9700
Steps: 0%| | 0/9700 [00:00<?, ?it/s]Traceback (most recent call last):
File "../train_text_to_image_lora_sd2_inpaint.py", line 1320, in
main()
File "../train_text_to_image_lora_sd2_inpaint.py", line 1047, in main
for step, batch in enumerate(train_dataloader):
File "../accelerate/data_loader.py", line 448, in iter
current_batch = next(dataloader_iter)

File ".../train_text_to_image_lora_sd2_inpaint.py", line 934, in collate_fn
pixel_values = _collate_imgs([example["pixel_values"] for example in examples])

File "..train_text_to_image_lora_sd2_inpaint.py", line 930, in _collate_imgs
vals = torch.stack(vals)

[in each of 8 GPUs the size in error message is different]
RuntimeError: stack expects each tensor to be equal size, but got [3, 512, 770] at entry 0 and [3, 512, 768] at entry 1
RuntimeError: stack expects each tensor to be equal size, but got [3, 512, 682] at entry 0 and [3, 512, 768] at entry 1
RuntimeError: stack expects each tensor to be equal size, but got [3, 512, 768] at entry 0 and [3, 512, 771] at entry 1
RuntimeError: stack expects each tensor to be equal size, but got [3, 512, 768] at entry 0 and [3, 725, 512] at entry 1
RuntimeError: stack expects each tensor to be equal size, but got [3, 512, 780] at entry 0 and [3, 512, 663] at entry 1
RuntimeError: stack expects each tensor to be equal size, but got [3, 512, 768] at entry 0 and [3, 512, 767] at entry 1
RuntimeError: stack expects each tensor to be equal size, but got [3, 512, 771] at entry 0 and [3, 512, 773] at entry 1
RuntimeError: stack expects each tensor to be equal size, but got [3, 512, 910] at entry 0 and [3, 512, 767] at entry 1

Shrivu Shankar · Answer 1 · Wed Feb 07 2024 10:45:44 GMT+0800 (China Standard Time)

The most likely culprit is that your images are of different sizes. In theory the script should automatically resize to 512x512 but I would try preprocessing them into 512x512 ahead of time to see if that fixes it.

I will also add that I've never tested this on a multi-GPU setup before so potentially other parts may not be supported.