PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

kie 训练时cuda报错 an illegal memory access was encountered.

wangpf09 opened this issue · comments

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

使用wildreceipt或者我们自己的数据集都无法正常训练,batch_size调整到1之后依然存在这个报错

  • 系统环境/System Environment:win11、GPU: 3060
  • 版本号/Version:Paddle:2.3.0.post112 PaddleOCR:2.5.0.3 问题相关组件/Related components:
  • 运行指令/Command Code:python ./train.py -c ../configs/kie/kie_unet_sdmgr.yml -o Global.pretrained_model=../pretrained_model/kie_vgg16/best_accuracy.pdparams
  • 完整报错/Complete Error Message:
W0625 09:48:01.554232 15140 gpu_context.cc:306] device: 0, cuDNN Version: 8.2.
[2022/06/25 09:48:04] ppocr INFO: load pretrain successful from ../pretrained_model/kie_vgg16/best_accuracy
[2022/06/25 09:48:04] ppocr INFO: train dataloader has 1267 iters
[2022/06/25 09:48:04] ppocr INFO: valid dataloader has 472 iters
[2022/06/25 09:48:04] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 80 iterations
Traceback (most recent call last):
  File "E:\project-space\ocr-tools\PaddleOCR\tools\train.py", line 191, in <module>
    main(config, device, logger, vdl_writer)
  File "E:\project-space\ocr-tools\PaddleOCR\tools\train.py", line 164, in main
    program.train(config, train_dataloader, valid_dataloader, device, model,
  File "E:\project-space\ocr-tools\PaddleOCR\tools\program.py", line 264, in train
    preds = model(batch)
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "E:\project-space\ocr-tools\PaddleOCR\ppocr\modeling\architectures\base_model.py", line 85, in forward
    x = self.head(x, targets=data)
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "E:\project-space\ocr-tools\PaddleOCR\ppocr\modeling\heads\kie_sdmgr_head.py", line 90, in forward
    nodes = self.fusion([x, nodes])
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "E:\project-space\ocr-tools\PaddleOCR\ppocr\modeling\heads\kie_sdmgr_head.py", line 189, in forward
    z = F.normalize(z)
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\paddle\nn\functional\norm.py", line 88, in normalize
    eps = fluid.dygraph.base.to_variable([epsilon], dtype=x.dtype)
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in __impl__
    return wrapped_func(*args, **kwargs)
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\framework.py", line 434, in __impl__
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\base.py", line 763, in to_variable
    py_var = core.VarBase(
OSError: (External) CUDA error(700), an illegal memory access was encountered. 
  [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:258)


Process finished with exit code 1

我在Windows下也碰到过这个问题,换到wsl里就好了,感觉是windows驱动和paddlepaddle-gpu之间的问题

#6533

我在Windows下也碰到过这个问题,换到wsl里就好了,感觉是windows驱动和paddlepaddle-gpu之间的问题

#6533

我用cpu训练没有问题,用GPU一直报错,并且我用wildreceipt数据集都可已正常训练,我试试wsl吧

我在Windows下也碰到过这个问题,换到wsl里就好了,感觉是windows驱动和paddlepaddle-gpu之间的问题
#6533

我用cpu训练没有问题,用GPU一直报错,并且我用wildreceipt数据集都可已正常训练,我试试wsl吧

我和你一模一样,报错的地方也一样,我也是CPU没问题,反正换了WSL就正常工作了

我在Windows下也碰到过这个问题,换到wsl里就好了,感觉是windows驱动和paddlepaddle-gpu之间的问题
#6533

我用cpu训练没有问题,用GPU一直报错,并且我用wildreceipt数据集都可已正常训练,我试试wsl吧

我和你一模一样,报错的地方也一样,我也是CPU没问题,反正换了WSL就正常工作了

裂开,官方对这个问题也不做处理~

commented

我也遇到了一样错误,貌似是 kie_sdmgr_head.py 60 行左右 这句
char_nums.append(paddle.sum((text > -1).astype(int), axis=-1)) 的问题,paddle bool 转int True 转int 不是1 是很大的一个数
导致下面 paddle.concat(
[text, paddle.zeros(
(text.shape[0], max_num - text.shape[1]))], -1)
要创建一个超大数组 就out memory了
paddlepaddle-gpu==2.3.2 cuda11.0 A100有问题
换成V100 cuda10.2 不会出现这个错误了

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.