mindspore-lab / mindocr

A toolbox of OCR models, algorithms, and pipelines based on MindSpore

Home Page:https://mindspore-lab.github.io/mindocr/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

出现NPU报错

meiyubin opened this issue · comments

  • C++ Call Stack: (For framework developers)

mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_data_queue.cc:269 QueryQueueSize

Line of code : 81
File : mindspore/ccsrc/minddata/dataset/util/task.cc

[ERROR] MD(41,ffe9f2ffd1f0,python):2023-08-21-06:23:16.639.874 [mindspore/ccsrc/minddata/dataset/util/task_manager.cc:223] InterruptMaster] Task is terminated with err msg (more details are in info level logs): Exception thrown from dataset pipeline. Refer to 'Dataset Pipeline Error Message'. Unable to query real-time size of Mbuf channel: eea1e8a8-3fea-11ee-b4c5-6ce874fff76f, error code: 507899


  • Ascend Error Message:

EL9999: Inner Error!
EL9999 [drv api] halQueueQueryInfo failed: deviceId=0, qid=1, drvRetCode=7.[FUNC:MemQueueQueryInfo][FILE:npu_driver.cc][LINE:3238]
TraceBack (most recent call last):
rtMemQueueQueryInfo execute failed, reason=[driver error:internal error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49]
[Call][Rts]call rtMemQueueQueryInfo failed, device is 0, qid is 1[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)

运行python tools/train.py --config configs/det/dbnet/db++_r50_icdar15.yaml命令报错,用的ReCTS数据集,已经用rects.md中的命令运行出det_gt.txt

root@manas-train-node01:/data/mindocr# python -c "import mindspore;mindspore.set_context(device_target='Ascend');mindspore.run_check()"
MindSpore version: 2.0.0rc1
The result of multiplication calculation is correct, MindSpore has been installed on platform [Ascend] successfully!

Start training... (The first epoch takes longer, please wait...)

[WARNING] MD(45501,fff200ff91f0,python):2023-08-21-08:14:34.658.871 [mindspore/ccsrc/minddata/dataset/engine/datasetops/data_queue_op.cc:903] DetectPerBatchTime] Bad performance attention, it takes more than 25 seconds to fetch a batch of data from dataset pipeline, which might result GetNext timeout problem. You may test dataset processing performance(with creating dataset iterator) and optimize it.
[ERROR] MD(45501,fff200ff91f0,python):2023-08-21-08:14:40.677.158 [mindspore/ccsrc/minddata/dataset/util/task_manager.cc:223] InterruptMaster] Task is terminated with err msg (more details are in info level logs): Exception thrown from dataset pipeline. Refer to 'Dataset Pipeline Error Message'. Unable to query real-time size of Mbuf channel: 8068338c-3ffa-11ee-8db2-6ce874fff76f, error code: 507899


  • Ascend Error Message:

EL9999: Inner Error!
EL9999 [drv api] halQueueQueryInfo failed: deviceId=0, qid=1, drvRetCode=7.[FUNC:MemQueueQueryInfo][FILE:npu_driver.cc][LINE:3238]
TraceBack (most recent call last):
rtMemQueueQueryInfo execute failed, reason=[driver error:internal error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49]
[Call][Rts]call rtMemQueueQueryInfo failed, device is 0, qid is 1[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)


  • C++ Call Stack: (For framework developers)

mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_data_queue.cc:269 QueryQueueSize

Line of code : 81
File : mindspore/ccsrc/minddata/dataset/util/task.cc

[WARNING] DEVICE(45501,fff2ab7fe1f0,python):2023-08-21-08:15:14.174.718 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:742] GetDumpPath] The environment variable 'MS_OM_PATH' is not set, the files of node dump will save to the process local path, as ./rank_id/node_dump/...
[WARNING] DEVICE(45501,fff2ab7fe1f0,python):2023-08-21-08:15:14.204.436 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:1390] DeleteDumpFile] Delete dir /data/mindocr/rank_0/node_dump/GetNext.GetNext-op866.0.1.1692604994242540.output.0.DefaultFormat.npy failed!
[WARNING] DEVICE(45501,fff2ab7fe1f0,python):2023-08-21-08:15:14.214.124 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:1390] DeleteDumpFile] Delete dir /data/mindocr/rank_0/node_dump/GetNext.GetNext-op866.0.1.1692604994580026.output.3.DefaultFormat.npy failed!
[WARNING] DEVICE(45501,fff2ab7fe1f0,python):2023-08-21-08:15:14.224.038 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:1390] DeleteDumpFile] Delete dir /data/mindocr/rank_0/node_dump/GetNext.GetNext-op866.0.1.1692604994511768.output.2.DefaultFormat.npy failed!
[WARNING] DEVICE(45501,fff2ab7fe1f0,python):2023-08-21-08:15:14.233.914 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:1390] DeleteDumpFile] Delete dir /data/mindocr/rank_0/node_dump/GetNext.GetNext-op866.0.1.1692604994443519.output.1.DefaultFormat.npy failed!
[WARNING] DEVICE(45501,fff2ab7fe1f0,python):2023-08-21-08:15:14.243.362 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:1390] DeleteDumpFile] Delete dir /data/mindocr/rank_0/node_dump/GetNext.GetNext-op866.0.1.1692604994650171.output.4.DefaultFormat.npy failed!
[ERROR] DEVICE(45501,fff2ab7fe1f0,python):2023-08-21-08:15:14.243.473 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:760] DumpTaskExceptionInfo] Task fail infos task_id: 2, stream_id: 3, tid: 45501, device_id: 0, retcode: 507011 ( model execute failed)
[WARNING] DEVICE(45501,fff2ab7fe1f0,python):2023-08-21-08:15:14.246.409 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:769] DumpTaskExceptionInfo] Dump task error infos (input/output's value) for node:[Default/GetNext-op866], save path: ./rank_0/node_dump,
The function call stack:
In file /usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/dataset_helper.py:94/ outputs = self.get_next()/

[WARNING] DEVICE(45501,fff2ab7fe1f0,python):2023-08-21-08:15:14.246.438 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:776] DumpTaskExceptionInfo] GetNext error may be caused by slow data processing (bigger than 20s / batch) or transfer data to device error.
[WARNING] DEVICE(45501,fff2ab7fe1f0,python):2023-08-21-08:15:14.246.453 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:778] DumpTaskExceptionInfo] Suggestion:
[WARNING] DEVICE(45501,fff2ab7fe1f0,python):2023-08-21-08:15:14.246.467 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:779] DumpTaskExceptionInfo] 1) Set the parameter dataset_sink_mode=False of model.train(...) or model.eval(...) and try again.
[WARNING] DEVICE(45501,fff2ab7fe1f0,python):2023-08-21-08:15:14.246.480 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:781] DumpTaskExceptionInfo] 2) Reduce the batch_size in data processing and try again.
[WARNING] DEVICE(45501,fff2ab7fe1f0,python):2023-08-21-08:15:14.246.493 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:782] DumpTaskExceptionInfo] 3) You can create iterator by interface create_dict_iterator() of dataset class to independently verify the performance of data processing without training. Refer to the link for data processing optimization suggestions: https://mindspore.cn/tutorials/experts/zh-CN/master/dataset/optimize.html
[WARNING] DEVICE(45501,fff2ab7fe1f0,python):2023-08-21-08:15:14.246.506 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:786] DumpTaskExceptionInfo] 4) If it is a dynamic dataset, please set the input to dynamic through set_inputs, or set sink_size to 1. It is recommended to use the former, because the latter has poor performance.
[WARNING] DEVICE(45501,fff2ab7fe1f0,python):2023-08-21-08:15:14.718.905 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:1116] RunTask] Destroy tdt channel failed.
Traceback (most recent call last):
File "tools/train.py", line 312, in
main(config)
File "tools/train.py", line 248, in main
initial_epoch=start_epoch,
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 1049, in train
initial_epoch=initial_epoch)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 100, in wrapper
func(self, *args, **kwargs)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 598, in _train
cb_params, sink_size, initial_epoch, valid_infos)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 681, in _train_dataset_sink_process
outputs = train_network(*inputs)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 620, in call
out = self.compile_and_run(*args, **kwargs)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 942, in compile_and_run
return _cell_graph_executor(self, *new_args, phase=self.phase)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/common/api.py", line 1439, in call
return self.run(obj, *args, phase=phase)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/common/api.py", line 1478, in run
return self._exec_pip(obj, *args, phase=phase_real)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/common/api.py", line 102, in wrapper
results = fn(*arg, **kwargs)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/common/api.py", line 1458, in _exec_pip
return self._graph_executor(args, phase)
RuntimeError: Run task for graph:kernel_graph_1 error! The details refer to 'Ascend Error Message'.


  • Ascend Error Message:

EL9999: Inner Error!
EL9999 [drv api] halQueueDestroy failed: deviceId=0, qid=1, drvRetCode=7.[FUNC:MemQueueDestroy][FILE:npu_driver.cc][LINE:2996]
TraceBack (most recent call last):
rtMemQueueDestroy execute failed, reason=[driver error:internal error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49]
[Call][Rts]call rts api [rtMemQueueDestroy] failed, retCode is 507899[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
E39999: Inner Error!
E39999 Aicpu kernel execute failed, device_id=0, stream_id=3, task_id=2, errorCode=91.[FUNC:PrintAicpuErrorInfo][FILE:task.cc][LINE:862]
TraceBack (most recent call last):
Aicpu kernel execute failed, device_id=0, stream_id=3, task_id=2, fault op_name=[FUNC:GetError][FILE:stream.cc][LINE:1133]
rtStreamSynchronize execute failed, reason=[the model stream execute failed][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49]

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
EL9999: Inner Error!
EL9999 [drv api] halQueueQueryInfo failed: deviceId=0, qid=1, drvRetCode=7.[FUNC:MemQueueQueryInfo][FILE:npu_driver.cc][LINE:3238]
TraceBack (most recent call last):
rtMemQueueQueryInfo execute failed, reason=[driver error:internal error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49]
[Call][Rts]call rtMemQueueQueryInfo failed, device is 0, qid is 1[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)


  • C++ Call Stack: (For framework developers)

mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_graph_executor.cc:256 RunGraph

[WARNING] MD(45501,ffffa20461a0,python):2023-08-21-08:15:21.402.701 [mindspore/ccsrc/minddata/dataset/engine/datasetops/data_queue_op.cc:115] ~DataQueueOp] preprocess_batch: 11; batch_queue: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0; push_start_time: 2023-08-21-08:13:04.696.711, 2023-08-21-08:13:06.260.506, 2023-08-21-08:13:10.447.333, 2023-08-21-08:13:13.348.705, 2023-08-21-08:14:34.658.753, 2023-08-21-08:14:34.897.697, 2023-08-21-08:14:35.244.371, 2023-08-21-08:14:36.841.525, 2023-08-21-08:14:37.574.301, 2023-08-21-08:14:39.307.348; push_end_time: 2023-08-21-08:13:04.732.580, 2023-08-21-08:13:06.297.322, 2023-08-21-08:13:10.485.841, 2023-08-21-08:13:13.386.329, 2023-08-21-08:14:34.714.893, 2023-08-21-08:14:34.937.644, 2023-08-21-08:14:35.290.069, 2023-08-21-08:14:36.880.453, 2023-08-21-08:14:37.611.451, 2023-08-21-08:14:39.345.333.
terminate called after throwing an instance of 'std::runtime_error'
what(): Failed to destroy channel for tdt queue. The details refer to 'Ascend Error Message'.


  • Ascend Error Message:

EL9999: Inner Error!
EL9999 [drv api] halQueueDestroy failed: deviceId=0, qid=1, drvRetCode=7.[FUNC:MemQueueDestroy][FILE:npu_driver.cc][LINE:2996]
TraceBack (most recent call last):
rtMemQueueDestroy execute failed, reason=[driver error:internal error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49]
[Call][Rts]call rts api [rtMemQueueDestroy] failed, retCode is 507899[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
EL9999: Inner Error!
EL9999 [drv api] halQueueDestroy failed: deviceId=0, qid=1, drvRetCode=7.[FUNC:MemQueueDestroy][FILE:npu_driver.cc][LINE:2996]
TraceBack (most recent call last):
rtMemQueueDestroy execute failed, reason=[driver error:internal error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49]
[Call][Rts]call rts api [rtMemQueueDestroy] failed, retCode is 507899[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
E39999: Inner Error!
E39999 Aicpu kernel execute failed, device_id=0, stream_id=3, task_id=2, errorCode=91.[FUNC:PrintAicpuErrorInfo][FILE:task.cc][LINE:862]
TraceBack (most recent call last):
Aicpu kernel execute failed, device_id=0, stream_id=3, task_id=2, fault op_name=[FUNC:GetError][FILE:stream.cc][LINE:1133]
rtStreamSynchronize execute failed, reason=[the model stream execute failed][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49]

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
EL9999: Inner Error!
EL9999 [drv api] halQueueQueryInfo failed: deviceId=0, qid=1, drvRetCode=7.[FUNC:MemQueueQueryInfo][FILE:npu_driver.cc][LINE:3238]
TraceBack (most recent call last):
rtMemQueueQueryInfo execute failed, reason=[driver error:internal error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49]
[Call][Rts]call rtMemQueueQueryInfo failed, device is 0, qid is 1[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)


  • C++ Call Stack: (For framework developers)

mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_data_queue.cc:250 ~AscendTdtQueue

您好,由于ReCTS数据集没有validation set,需要将val_while_train 设置成False