kie 训练时cuda报错 an illegal memory access was encountered.

Question

kie 训练时cuda报错 an illegal memory access was encountered.

wangpf09 opened this issue 2 years ago · comments

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

使用wildreceipt或者我们自己的数据集都无法正常训练，batch_size调整到1之后依然存在这个报错

系统环境/System Environment：win11、GPU: 3060
版本号/Version：Paddle：2.3.0.post112 PaddleOCR：2.5.0.3 问题相关组件/Related components：
运行指令/Command Code：python ./train.py -c ../configs/kie/kie_unet_sdmgr.yml -o Global.pretrained_model=../pretrained_model/kie_vgg16/best_accuracy.pdparams
完整报错/Complete Error Message：

W0625 09:48:01.554232 15140 gpu_context.cc:306] device: 0, cuDNN Version: 8.2.
[2022/06/25 09:48:04] ppocr INFO: load pretrain successful from ../pretrained_model/kie_vgg16/best_accuracy
[2022/06/25 09:48:04] ppocr INFO: train dataloader has 1267 iters
[2022/06/25 09:48:04] ppocr INFO: valid dataloader has 472 iters
[2022/06/25 09:48:04] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 80 iterations
Traceback (most recent call last):
  File "E:\project-space\ocr-tools\PaddleOCR\tools\train.py", line 191, in <module>
    main(config, device, logger, vdl_writer)
  File "E:\project-space\ocr-tools\PaddleOCR\tools\train.py", line 164, in main
    program.train(config, train_dataloader, valid_dataloader, device, model,
  File "E:\project-space\ocr-tools\PaddleOCR\tools\program.py", line 264, in train
    preds = model(batch)
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "E:\project-space\ocr-tools\PaddleOCR\ppocr\modeling\architectures\base_model.py", line 85, in forward
    x = self.head(x, targets=data)
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "E:\project-space\ocr-tools\PaddleOCR\ppocr\modeling\heads\kie_sdmgr_head.py", line 90, in forward
    nodes = self.fusion([x, nodes])
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "E:\project-space\ocr-tools\PaddleOCR\ppocr\modeling\heads\kie_sdmgr_head.py", line 189, in forward
    z = F.normalize(z)
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\paddle\nn\functional\norm.py", line 88, in normalize
    eps = fluid.dygraph.base.to_variable([epsilon], dtype=x.dtype)
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in __impl__
    return wrapped_func(*args, **kwargs)
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\framework.py", line 434, in __impl__
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\base.py", line 763, in to_variable
    py_var = core.VarBase(
OSError: (External) CUDA error(700), an illegal memory access was encountered. 
  [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:258)


Process finished with exit code 1

Yifei Chen · Answer 1 · Tue Jun 28 2022 19:24:27 GMT+0800 (China Standard Time)

我在Windows下也碰到过这个问题，换到wsl里就好了，感觉是windows驱动和paddlepaddle-gpu之间的问题

#6533

wangpf · Answer 2 · Tue Jun 28 2022 19:33:59 GMT+0800 (China Standard Time)

我在Windows下也碰到过这个问题，换到wsl里就好了，感觉是windows驱动和paddlepaddle-gpu之间的问题

#6533

我用cpu训练没有问题，用GPU一直报错，并且我用wildreceipt数据集都可已正常训练，我试试wsl吧

Yifei Chen · Answer 3 · Tue Jun 28 2022 20:31:11 GMT+0800 (China Standard Time)

我在Windows下也碰到过这个问题，换到wsl里就好了，感觉是windows驱动和paddlepaddle-gpu之间的问题
#6533

我用cpu训练没有问题，用GPU一直报错，并且我用wildreceipt数据集都可已正常训练，我试试wsl吧

我和你一模一样，报错的地方也一样，我也是CPU没问题，反正换了WSL就正常工作了

wangpf · Answer 4 · Tue Jun 28 2022 21:45:05 GMT+0800 (China Standard Time)

我在Windows下也碰到过这个问题，换到wsl里就好了，感觉是windows驱动和paddlepaddle-gpu之间的问题
#6533

我用cpu训练没有问题，用GPU一直报错，并且我用wildreceipt数据集都可已正常训练，我试试wsl吧

我和你一模一样，报错的地方也一样，我也是CPU没问题，反正换了WSL就正常工作了

裂开，官方对这个问题也不做处理~

gjj123 · Answer 5 · Mon Feb 20 2023 13:08:46 GMT+0800 (China Standard Time)

我也遇到了一样错误，貌似是 kie_sdmgr_head.py 60 行左右这句
char_nums.append(paddle.sum((text > -1).astype(int), axis=-1)) 的问题，paddle bool 转int True 转int 不是1 是很大的一个数
导致下面 paddle.concat(
[text, paddle.zeros(
(text.shape[0], max_num - text.shape[1]))], -1)
要创建一个超大数组就out memory了
paddlepaddle-gpu==2.3.2 cuda11.0 A100有问题
换成V100 cuda10.2 不会出现这个错误了

github-actions · Answer 6 · Sun Jul 02 2023 10:14:25 GMT+0800 (China Standard Time)

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.