training error on colab

Question

training error on colab

hb0313 opened this issue 2 years ago · comments

Harshad Bhandwaldar commented 2 years ago

My all setup is successful on colab for training. However, when I run

!python tools/train.py --cfg configs/CONFIG_FILE.yaml

I get error:

Found 20210 training images.
Found 2000 validation images.
Epoch: [1/500] Iter: [0/2526] LR: 0.00100000 Loss: 0.00000000: 0% 0/2526 [00:00<?, ?it/s]
Traceback (most recent call last):
File "tools/train.py", line 128, in
main(cfg, gpu, save_dir)
File "tools/train.py", line 69, in main
for iter, (img, lbl) in pbar:
File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1195, in iter
for obj in iterable:
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 681, in next
data = self._next_data()
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
data.reraise()
File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 461, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/content/semantic-segmentation/semseg/datasets/ade20k.py", line 73, in getitem
image, label = self.transform(image, label)
File "/content/semantic-segmentation/semseg/augmentations.py", line 20, in call
img, mask = transform(img, mask)
File "/content/semantic-segmentation/semseg/augmentations.py", line 329, in call
mask = TF.pad(mask, padding, fill=self.seg_fill)
File "/usr/local/lib/python3.7/dist-packages/torchvision/transforms/functional.py", line 481, in pad
return F_t.pad(img, padding=padding, fill=fill, padding_mode=padding_mode)
File "/usr/local/lib/python3.7/dist-packages/torchvision/transforms/functional_tensor.py", line 418, in pad
img = torch_pad(img, p, mode=padding_mode, value=float(fill))
RuntimeError: value cannot be converted to type uint8_t without overflow

sithu3 · Answer 1 · Wed Sep 14 2022 13:25:09 GMT+0800 (China Standard Time)

I think it is the pytorch version mismatch error. Please try different pytorch version.

scl666 · Answer 2 · Tue Oct 11 2022 15:40:45 GMT+0800 (China Standard Time)

I think it is the pytorch version mismatch error. Please try different pytorch version.

Hello, I use the camvid to train，get the error：
min_value = pred[min(self.min_kept, pred.numel() - 1)]
IndexError: index -1 is out of bounds for dimension 0 with size 0

Ilana karimov · Answer 3 · Sat Jun 17 2023 00:03:54 GMT+0800 (China Standard Time)

Hello,

I encountered the same error, and updating torch and torchvision did not resolve it. The issue appears to arise when seg_fill receives a value of -1, as defined in the ade20k config file (IGNORE_LABEL: -1). Changing the value of IGNORE_LABEL resolved the problem. Could you please advise on the appropriate value that IGNORE_LABEL should be set to?