RuntimeError: CUDA error: an illegal memory access was encountered

Question

RuntimeError: CUDA error: an illegal memory access was encountered

StuLiu opened this issue 2 years ago · comments

The error was occured after changing batch_size from 8 to 4 in cityscapes.yaml.

Found 2975 train images.
Found 500 val images.
Epoch: [1/500] Iter: [1/185] LR: 0.00010049 Loss: 7.71337414: 1%|▌ | 1/185 [00:03<11:41, 3.81s/it]
Traceback (most recent call last):
File "tools/train.py", line 153, in
main(cfg, gpu, save_dir)
File "tools/train.py", line 97, in main
scaler.scale(loss).backward()
File "/home/liuwang/Softwares/anaconda3/envs/semseg/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/liuwang/Softwares/anaconda3/envs/semseg/lib/python3.8/site-packages/torch/autograd/init.py", line 147, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: an illegal memory access was encountered
(semseg) liuwang@liuwang-OMEN-30L:~/Documents/projects/semantic-segmentation$ CUDA_LAUNCH_BLOCKING=1 python tools/train.py --cfg configs/DDRNet/cityscapes.yaml
Found 2975 train images.
Found 500 val images.
Epoch: [1/500] Iter: [1/743] LR: 0.00010012 Loss: 7.53493118: 0%|▏ | 1/743 [00:03<37:21, 3.02s/it]
Traceback (most recent call last):
File "tools/train.py", line 153, in
main(cfg, gpu, save_dir)
File "tools/train.py", line 95, in main
loss = loss_fn(logits, lbl)
File "/home/liuwang/Softwares/anaconda3/envs/semseg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/liuwang/Documents/projects/semantic-segmentation/semseg/losses.py", line 43, in forward
return sum([w * self._forward(pred, labels) for (pred, w) in zip(preds, self.aux_weights)])
File "/home/liuwang/Documents/projects/semantic-segmentation/semseg/losses.py", line 43, in
return sum([w * self._forward(pred, labels) for (pred, w) in zip(preds, self.aux_weights)])
File "/home/liuwang/Documents/projects/semantic-segmentation/semseg/losses.py", line 33, in _forward
loss = self.criterion(preds, labels).view(-1)
File "/home/liuwang/Softwares/anaconda3/envs/semseg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/liuwang/Softwares/anaconda3/envs/semseg/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1120, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "/home/liuwang/Softwares/anaconda3/envs/semseg/lib/python3.8/site-packages/torch/nn/functional.py", line 2824, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: CUDA error: an illegal memory access was encountered
Exception in thread Thread-2:
Traceback (most recent call last):
File "/home/liuwang/Softwares/anaconda3/envs/semseg/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/liuwang/Softwares/anaconda3/envs/semseg/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/liuwang/Softwares/anaconda3/envs/semseg/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in _pin_memory_loop
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/home/liuwang/Softwares/anaconda3/envs/semseg/lib/python3.8/multiprocessing/queues.py", line 116, in get
return _ForkingPickler.loads(res)
File "/home/liuwang/Softwares/anaconda3/envs/semseg/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 289, in rebuild_storage_fd
fd = df.detach()
File "/home/liuwang/Softwares/anaconda3/envs/semseg/lib/python3.8/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/home/liuwang/Softwares/anaconda3/envs/semseg/lib/python3.8/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/home/liuwang/Softwares/anaconda3/envs/semseg/lib/python3.8/multiprocessing/connection.py", line 508, in Client
answer_challenge(c, authkey)
File "/home/liuwang/Softwares/anaconda3/envs/semseg/lib/python3.8/multiprocessing/connection.py", line 752, in answer_challenge
message = connection.recv_bytes(256) # reject large message
File "/home/liuwang/Softwares/anaconda3/envs/semseg/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/liuwang/Softwares/anaconda3/envs/semseg/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/home/liuwang/Softwares/anaconda3/envs/semseg/lib/python3.8/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

Rahul Singhal · Answer 1 · Wed Apr 20 2022 10:55:31 GMT+0800 (China Standard Time)

How large is the RAM on your GPU?

Wang Liu · Answer 2 · Wed Apr 20 2022 10:57:20 GMT+0800 (China Standard Time)

How large is the RAM on your GPU?

3090-24GB

Rahul Singhal · Answer 3 · Wed Apr 20 2022 11:44:40 GMT+0800 (China Standard Time)

I had an issue with model size and used [get_model_size()](https://github.com/sithu31296/semantic-segmentation/blob/35cc1bd6dba63b7d8bc63195aa7cf1e1637f03fb/semseg/utils/utils.py#:~:text=def-,get_model_size,-(model%3A%20Union%5Bnn) to get a sense of it.

Rahul Singhal · Answer 4 · Sat Apr 23 2022 10:50:17 GMT+0800 (China Standard Time)

I just enabled Autocast Mixed Precision (AMP) in my config file and am now able to train on Colab V100 with MiT-B5/SegFormer.

sithu3 · Answer 5 · Sat Apr 23 2022 21:57:11 GMT+0800 (China Standard Time)

@StuLiu So many people also faced this illegal memory access error. But I still haven't. What is your pytorch version? If you happen to fix that, leave a solution here.

Wang Liu · Answer 6 · Sun Apr 24 2022 11:57:31 GMT+0800 (China Standard Time)

@StuLiu So many people also faced this illegal memory access error. But I still haven't. What is your pytorch version? If you happen to fix that, leave a solution here.

torch==1.9.0+cu111.
This error didn't occured when I trained SFNet in the cityscapes datasets. However, it occured when training in DDRNet.

Wang Liu · Answer 7 · Thu Apr 28 2022 11:29:24 GMT+0800 (China Standard Time)

The error disappered when I changed ddrnet config file as following.
`
DEVICE : cuda # device used for training and evaluation (cpu, cuda, cuda0, cuda1, ...)
SAVE_DIR : 'output' # output folder name used for saving the model, logs and inference results

MODEL:
NAME : DDRNet # name of the model you are using
BACKBONE : DDRNet-23slim # model variant
PRETRAINED : 'checkpoints/backbones/ddrnet/DDRNet23s_imagenet.pth' # backbone model's weight

DATASET:
NAME : CityScapes # dataset name to be trained with (camvid, cityscapes, ade20k)
ROOT : 'data/cityscapes' # dataset root path
IGNORE_LABEL : 255

TRAIN:
IMAGE_SIZE : [1024, 1024] # training image size in (h, w)
BATCH_SIZE : 16 # batch size used to train
EPOCHS : 500 # number of epochs to train
EVAL_INTERVAL : 10 # evaluation interval during training
AMP : false # use AMP in training
DDP : false # use DDP training

LOSS:
NAME : OhemCrossEntropy # loss function name (ohemce, ce, dice)
CLS_WEIGHTS : true # use class weights in loss calculation

OPTIMIZER:
NAME : adamw # optimizer name
LR : 0.001 # initial learning rate used in optimizer
WEIGHT_DECAY : 0.01 # decay rate used in optimizer

SCHEDULER:
NAME : warmuppolylr # scheduler name
POWER : 0.9 # scheduler power
WARMUP : 10 # warmup epochs used in scheduler
WARMUP_RATIO : 0.1 # warmup ratio

EVAL:
MODEL_PATH : 'output/DDRNet23_slim_citycapes.pth' # trained model file path
IMAGE_SIZE : [1024, 1024] # evaluation image size in (h, w)
MSF:
ENABLE : false # multi-scale and flip evaluation
FLIP : true # use flip in evaluation
SCALES : [0.5, 0.75, 1.0, 1.25, 1.5, 1.75] # scales used in MSF evaluation

TEST:
MODEL_PATH : 'checkpoints/pretrained/ddrnet/ddrnet_23_city.pth' # trained model file path
FILE : 'assests/cityscapes' # filename or foldername
IMAGE_SIZE : [1024, 1024] # inference image size in (h, w)
OVERLAY : true # save the overlay result (image_alpha+label_alpha)
`

robbie · Answer 8 · Wed Jul 27 2022 11:01:33 GMT+0800 (China Standard Time)

The error disappered when I changed ddrnet config file as following. ` DEVICE : cuda # device used for training and evaluation (cpu, cuda, cuda0, cuda1, ...) SAVE_DIR : 'output' # output folder name used for saving the model, logs and inference results

MODEL: NAME : DDRNet # name of the model you are using BACKBONE : DDRNet-23slim # model variant PRETRAINED : 'checkpoints/backbones/ddrnet/DDRNet23s_imagenet.pth' # backbone model's weight

DATASET: NAME : CityScapes # dataset name to be trained with (camvid, cityscapes, ade20k) ROOT : 'data/cityscapes' # dataset root path IGNORE_LABEL : 255

TRAIN: IMAGE_SIZE : [1024, 1024] # training image size in (h, w) BATCH_SIZE : 16 # batch size used to train EPOCHS : 500 # number of epochs to train EVAL_INTERVAL : 10 # evaluation interval during training AMP : false # use AMP in training DDP : false # use DDP training

LOSS: NAME : OhemCrossEntropy # loss function name (ohemce, ce, dice) CLS_WEIGHTS : true # use class weights in loss calculation

OPTIMIZER: NAME : adamw # optimizer name LR : 0.001 # initial learning rate used in optimizer WEIGHT_DECAY : 0.01 # decay rate used in optimizer

SCHEDULER: NAME : warmuppolylr # scheduler name POWER : 0.9 # scheduler power WARMUP : 10 # warmup epochs used in scheduler WARMUP_RATIO : 0.1 # warmup ratio

EVAL: MODEL_PATH : 'output/DDRNet23_slim_citycapes.pth' # trained model file path IMAGE_SIZE : [1024, 1024] # evaluation image size in (h, w) MSF: ENABLE : false # multi-scale and flip evaluation FLIP : true # use flip in evaluation SCALES : [0.5, 0.75, 1.0, 1.25, 1.5, 1.75] # scales used in MSF evaluation

TEST: MODEL_PATH : 'checkpoints/pretrained/ddrnet/ddrnet_23_city.pth' # trained model file path FILE : 'assests/cityscapes' # filename or foldername IMAGE_SIZE : [1024, 1024] # inference image size in (h, w) OVERLAY : true # save the overlay result (image_alpha+label_alpha) `

your training is single gpu? I have used your same config, single gpu is 8000,48G, but i face the same problem. But when i change the IMAGE_SIZE to [512, 512], the training is ok.

AnAutomaticPencil · Answer 9 · Wed Dec 07 2022 14:32:03 GMT+0800 (China Standard Time)

I met the problem also. I think the error is caused by the label. I use torch 1.8.1+cu111, but it throws this error by the loss calculation. After checking the loss, I find the label id is wrong here. The label id -1 is invalid when calculating the CE loss. Am I right?

reflelia · Answer 10 · Mon Jan 23 2023 18:39:44 GMT+0800 (China Standard Time)

I met the problem also. I think the error is caused by the label. I use torch 1.8.1+cu111, but it throws this error by the loss calculation. After checking the loss, I find the label id is wrong here. The label id -1 is invalid when calculating the CE loss. Am I right?

I remove
'''-1 : -1'''
in ID2TRAINID and it works

MINI · Answer 11 · Sun May 21 2023 10:42:00 GMT+0800 (China Standard Time)

I met the problem also. I think the error is caused by the label. I use torch 1.8.1+cu111, but it throws this error by the loss calculation. After checking the loss, I find the label id is wrong here. The label id -1 is invalid when calculating the CE loss. Am I right?

I remove '''-1 : -1''' in ID2TRAINID and it works

Thank you it's work for me too.

Asteriska · Answer 12 · Thu Jul 27 2023 15:28:32 GMT+0800 (China Standard Time)

the same error.
for custom datasets, the ignore_label should be set carefully.