mindspore-lab / mindocr

A toolbox of ocr models and algorithms based on MindSpore

Home Page:https://mindspore-lab.github.io/mindocr/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dbnet在Ascend910B3上训练速度非常慢

zx214 opened this issue · comments

1、环境说明:Ascend910B3、mindspore2.2.0、CANN 7.0.rc1,训练集1012,测试集数量253,batch为8
2、训练log如下链接:
dbnet_log.log
3、yaml配置文件为:
system:
mode: 0 # 0 for graph mode, 1 for pynative mode in MindSpore
distribute: False
amp_level: 'O0'
seed: 42
log_interval: 100
val_while_train: True
drop_overflow_update: False
ckpt_max_keep: 0

common:
num_epochs: &num_epochs 1
batch_size: &batch_size 1
num_workers: &num_workers 8
ckpt_save_dir: &ckpt_save_dir './tmp_det'
ckpt_load_path: &ckpt_load_path ./tmp_det/best.ckpt
dataset_root: &dataset_root ./dataset_ic15/det
resume: &resume True
lr: &lr 0.007
resume_epochs: 0
data_shape: (736,1280)

model:
type: det
transform: null
backbone:
name: det_resnet50
pretrained: True
neck:
name: DBFPN
out_channels: 256
bias: False
head:
name: DBHead
k: 50
bias: False
adaptive: True
resume: True
pretrained: False

postprocess:
name: DBPostprocess
box_type: quad # whether to output a polygon or a box
binary_thresh: 0.3 # binarization threshold
box_thresh: 0.6 # box score threshold
max_candidates: 1000
expand_ratio: 1.5 # coefficient for expanding predictions

metric:
name: DetMetric
main_indicator: f-score

loss:
name: DBLoss
eps: 1.0e-6
l1_scale: 10
bce_scale: 5
bce_replace: bceloss

scheduler:
scheduler: polynomial_decay
lr: *lr
num_epochs: *num_epochs
decay_rate: 0.9
warmup_epochs: 3

optimizer:
opt: SGD
filter_bias_and_bn: false
momentum: 0.9
weight_decay: 1.0e-4

only used for mixed precision training

loss_scaler:
type: dynamic
loss_scale: 512
scale_factor: 2
scale_window: 1000

train:
ckpt_save_dir: *ckpt_save_dir
dataset_sink_mode: False
gradient_accumulation_steps: 1
dataset:
type: DetDataset
dataset_root: *dataset_root
data_dir: train/images
label_file: train_gt.txt
sample_ratio: 1.0
transform_pipeline:
- DecodeImage:
img_mode: RGB
to_float32: False
- DetLabelEncode:
- RandomColorAdjust:
brightness: 0.1255 # 32.0 / 255
saturation: 0.5
- RandomHorizontalFlip:
p: 0.5
- RandomRotate:
degrees: [ -10, 10 ]
expand_canvas: False
p: 1.0
- RandomScale:
scale_range: [ 0.5, 3.0 ]
p: 1.0
- RandomCropWithBBox:
max_tries: 10
min_crop_ratio: 0.1
crop_size: [ 640, 640 ]
p: 1.0
- ValidatePolygons:
- ShrinkBinaryMap:
min_text_size: 8
shrink_ratio: 0.4
- BorderMap:
shrink_ratio: 0.4
thresh_min: 0.3
thresh_max: 0.7
- NormalizeImage:
bgr_to_rgb: False
is_hwc: True
mean: imagenet
std: imagenet
- ToCHWImage:
# the order of the dataloader list, matching the network input and the input labels for the loss function, and optional data for debug/visualize
output_columns: [ 'image', 'binary_map', 'mask', 'thresh_map', 'thresh_mask' ] #'img_path']

output_columns: ['image'] # for debug op performance

net_input_column_index: [0] # input indices for network forward func in output_columns
label_column_index: [1, 2, 3, 4] # input indices marked as label

loader:
shuffle: True
batch_size: *batch_size
drop_remainder: True
num_workers: *num_workers

eval:
ckpt_load_path: tmp_det/best.ckpt
dataset_sink_mode: False
dataset:
type: DetDataset
dataset_root: *dataset_root
data_dir: val/images
label_file: test_gt.txt
sample_ratio: 1.0
transform_pipeline:
- DecodeImage:
img_mode: RGB
to_float32: False
- DetLabelEncode:
- DetResize: # GridResize 32
target_size: [ 736, 1280 ]
keep_ratio: False
limit_type: none
divisor: 32
- NormalizeImage:
bgr_to_rgb: False
is_hwc: True
mean: imagenet
std: imagenet
- ToCHWImage:
# the order of the dataloader list, matching the network input and the labels for evaluation
output_columns: [ 'image', 'polys', 'ignore_tags', 'shape_list' ]
net_input_column_index: [0] # input indices for network forward func in output_columns
label_column_index: [1, 2] # input indices marked as label

loader:
shuffle: False
batch_size: 1 # TODO: due to dynamic shape of polygons (num of boxes varies), BS has to be 1
drop_remainder: False
num_workers: 2

4、速度慢目前定位为两部分: ①kernel_meta信息编译大概需要一个多小时;②训练速度慢;请问如何解决呢

@zx214 您好,感谢您的反馈。

  1. 网络编译耗时超过1小时是不符合预期的。
    建议您将MindSpore升级到r2.2.12或r2.2.11,并安装匹配的Ascend CANN包。r2.2.12相较于r2.2.0修复了部分bug,并对性能进行了优化。我们在相同的硬件设备上,使用MindSpore r2.2.11训练DBNet ResNet-50,编译耗时在1-2分钟左右。

  2. 训练速度慢。
    首先,注意到您的训练epoch数为1,在训练起始阶段,由于需要进行编译加载等操作,会显著较后续epoch慢。
    其次,建议将train.dataset_sink_mode设置为True,打开数据下沉模式能够显著提升训练效能。
    第三,如果您的CPU单核能力较弱,可尝试将train.dataset.use_minddata设置为True,或尝试适当调高train.loader.num_workers的数值(太高会有反作用),可提升数据预处理计算效率;
    最后,可尝试调整system.amp_levelO2O3,将部分运算切换为Ascend硬件亲和性更好的fp16模式,但部分网络的收敛难度会变大(甚至会不收敛),需要配套调整超参数和train.clip_gradtrain.clip_norm等选项。

以下是我们在相同的硬件设备上,使用MindSpore r2.2.11训练DBNet ResNet-50的日志,供您参考:

(MindSpore) [ma-user mindocr-main]$DEVICE_ID=2 python tools/train.py -c configs/det/dbnet/db_r50_icdar15.yaml
[2024-03-21 16:42:06] mindocr.train INFO - Standalone training. Device id: 2, specified by environment variable 'DEVICE_ID'.
[2024-03-21 16:42:11] mindocr.data.builder INFO - Creating dataloader (training=True) for device 0. Number of data samples: 1000 per device (1000 total).
[2024-03-21 16:42:39] mindocr.models.backbones.mindcv_models.utils INFO - Finish loading model checkpoint from: /home/ma-user/.mindspore/models/resnet50-e0733ab8.ckpt
[2024-03-21 16:42:39] mindocr.utils.loss_scaler INFO - gradient_accumulation_steps > 1, scale_factor and scale_window are adjusted accordingly for dynamic loss scaler
[2024-03-21 16:42:39] mindocr.optim.param_grouping INFO - no parameter grouping is applied.
[2024-03-21 16:42:48] mindocr.train INFO -
========================================
Distribute: False
Model: det_resnet50-DBFPN-DBHead
Total number of parameters: 25463362
Total number of trainable parameters: 25409730
Data root: /home/ma-user/work/evaluation-sh/MindOCR/mindocr-main/dataset/
Optimizer: SGD
Weight decay: 0.0001
Batch size: 10
Num devices: 1
Gradient accumulation steps: 2
Global batch size: 10x1x2=20
LR: 0.007
Scheduler: polynomial_decay
Steps per epoch: 100
Num epochs: 1200
Clip gradient: False
EMA: False
AMP level: O0
Loss scaler: {'type': 'dynamic', 'loss_scale': 512, 'scale_factor': 2, 'scale_window': 1000}
Drop overflow update: False
========================================

Start training... (The first epoch takes longer, please wait...)

[2024-03-21 16:44:03] mindocr.utils.callbacks INFO - epoch: [1/1200], loss: 5.056544, epoch time: 74.750 s, per step time: 747.501 ms, fps per card: 13.38 img/s
[2024-03-21 16:44:03] mindocr.utils.callbacks INFO - => Best train_loss: 5.056544303894043, checkpoint saved.
[2024-03-21 16:44:22] mindocr.utils.callbacks INFO - epoch: [2/1200], loss: 4.930840, epoch time: 15.967 s, per step time: 159.667 ms, fps per card: 62.63 img/s
[2024-03-21 16:44:22] mindocr.utils.callbacks INFO - => Best train_loss: 4.930840015411377, checkpoint saved.
[2024-03-21 16:44:41] mindocr.utils.callbacks INFO - epoch: [3/1200], loss: 4.721155, epoch time: 15.963 s, per step time: 159.631 ms, fps per card: 62.64 img/s
[2024-03-21 16:44:42] mindocr.utils.callbacks INFO - => Best train_loss: 4.721155166625977, checkpoint saved.
[2024-03-21 16:45:00] mindocr.utils.callbacks INFO - epoch: [4/1200], loss: 4.198533, epoch time: 15.969 s, per step time: 159.689 ms, fps per card: 62.62 img/s
[2024-03-21 16:45:01] mindocr.utils.callbacks INFO - => Best train_loss: 4.198533058166504, checkpoint saved.
[2024-03-21 16:45:19] mindocr.utils.callbacks INFO - epoch: [5/1200], loss: 4.210309, epoch time: 15.959 s, per step time: 159.593 ms, fps per card: 62.66 img/s
[2024-03-21 16:45:38] mindocr.utils.callbacks INFO - epoch: [6/1200], loss: 3.534603, epoch time: 15.959 s, per step time: 159.588 ms, fps per card: 62.66 img/s
[2024-03-21 16:45:38] mindocr.utils.callbacks INFO - => Best train_loss: 3.5346031188964844, checkpoint saved.
[2024-03-21 16:45:57] mindocr.utils.callbacks INFO - epoch: [7/1200], loss: 3.849508, epoch time: 15.965 s, per step time: 159.652 ms, fps per card: 62.64 img/s
[2024-03-21 16:46:15] mindocr.utils.callbacks INFO - epoch: [8/1200], loss: 3.630200, epoch time: 15.958 s, per step time: 159.576 ms, fps per card: 62.67 img/s
[2024-03-21 16:46:34] mindocr.utils.callbacks INFO - epoch: [9/1200], loss: 3.526414, epoch time: 15.963 s, per step time: 159.625 ms, fps per card: 62.65 img/s
[2024-03-21 16:46:34] mindocr.utils.callbacks INFO - => Best train_loss: 3.526414394378662, checkpoint saved.
[2024-03-21 16:46:53] mindocr.utils.callbacks INFO - epoch: [10/1200], loss: 3.429994, epoch time: 15.957 s, per step time: 159.572 ms, fps per card: 62.67 img/s
[2024-03-21 16:46:53] mindocr.utils.callbacks INFO - => Best train_loss: 3.4299941062927246, checkpoint saved.
[2024-03-21 16:47:12] mindocr.utils.callbacks INFO - epoch: [11/1200], loss: 2.924218, epoch time: 15.966 s, per step time: 159.657 ms, fps per card: 62.63 img/s
[2024-03-21 16:47:12] mindocr.utils.callbacks INFO - => Best train_loss: 2.92421817779541, checkpoint saved.
[2024-03-21 16:47:31] mindocr.utils.callbacks INFO - epoch: [12/1200], loss: 3.281451, epoch time: 15.956 s, per step time: 159.565 ms, fps per card: 62.67 img/s
[2024-03-21 16:47:49] mindocr.utils.callbacks INFO - epoch: [13/1200], loss: 3.589340, epoch time: 15.958 s, per step time: 159.577 ms, fps per card: 62.67 img/s
[2024-03-21 16:48:07] mindocr.utils.callbacks INFO - epoch: [14/1200], loss: 3.343097, epoch time: 15.958 s, per step time: 159.584 ms, fps per card: 62.66 img/s
[2024-03-21 16:48:26] mindocr.utils.callbacks INFO - epoch: [15/1200], loss: 3.465981, epoch time: 15.956 s, per step time: 159.561 ms, fps per card: 62.67 img/s
[2024-03-21 16:48:44] mindocr.utils.callbacks INFO - epoch: [16/1200], loss: 3.380215, epoch time: 15.960 s, per step time: 159.596 ms, fps per card: 62.66 img/s
[2024-03-21 16:49:03] mindocr.utils.callbacks INFO - epoch: [17/1200], loss: 3.600936, epoch time: 15.959 s, per step time: 159.589 ms, fps per card: 62.66 img/s
[2024-03-21 16:49:21] mindocr.utils.callbacks INFO - epoch: [18/1200], loss: 3.199931, epoch time: 15.958 s, per step time: 159.581 ms, fps per card: 62.66 img/s
[2024-03-21 16:49:40] mindocr.utils.callbacks INFO - epoch: [19/1200], loss: 3.428349, epoch time: 15.957 s, per step time: 159.572 ms, fps per card: 62.67 img/s
[2024-03-21 16:49:59] mindocr.utils.callbacks INFO - epoch: [20/1200], loss: 3.225470, epoch time: 15.958 s, per step time: 159.576 ms, fps per card: 62.67 img/s

感谢,我来试一下

哈喽,安装官网安装了mindspore2.2.12和CANN7.0.0.beta1,测试源代码模型编译时间没问题,根据任务需求添加了summary的记录和修改数据读取部分,模型部分没有动;但是模型编译时间还是很长,请问模型编译速度慢可能因为什么影响呢

@zx214 您好。根据您的描述,似乎是从启动Python脚本,到开始训练计算的耗时较长,而不一定是MindSpore的静态图模型编译耗时较长。

建议您考虑使用以下方案,定界出真正耗时较长的代码:

  1. 将新增的Summary模块、数据加载模块相关代码,从项目中剥离出来,进行单元测试,测算耗时;
  2. 在项目代码tools/main.py中,使用Python的time包,计时分析各阶段耗时。

根据您的描述,耗时较长的原因可能有:

  1. Summary写入过多冗余信息,导致耗时过长;
  2. 加载数据的效率较低,如数据集的单样本分辨率很高,或硬盘读写速率较低;
  3. MindSpore Graph/静态图首次执行图编译会消耗相对多的时间,非首次执行图编译则会复用之前生成的编译缓存。如果您每次手工删除了编译缓存,会导致编译时间变长。

好的,我定位一下,再向您请教个问题在2卡训练中,输出这样的日志,怎样原因导致的,如何解决呢
image

@zx214 您好,根据您反馈的错误日志。因为MindSpore源代码中未包含相关错误信息,该问题可能不是MindSpore引入的。建议您检查一下Ascend的环境配置,包括硬件是否安装正常,CANN包是否安装正确(例如安装完CANN后是否安装了te和hccl的whl包)。