dbnet在Ascend910B3上训练速度非常慢

Question

dbnet在Ascend910B3上训练速度非常慢

zx214 opened this issue 4 months ago · comments

1、环境说明：Ascend910B3、mindspore2.2.0、CANN 7.0.rc1，训练集1012，测试集数量253，batch为8
2、训练log如下链接：
dbnet_log.log
3、yaml配置文件为：
system:
mode: 0 # 0 for graph mode, 1 for pynative mode in MindSpore
distribute: False
amp_level: 'O0'
seed: 42
log_interval: 100
val_while_train: True
drop_overflow_update: False
ckpt_max_keep: 0

common:
num_epochs: &num_epochs 1
batch_size: &batch_size 1
num_workers: &num_workers 8
ckpt_save_dir: &ckpt_save_dir './tmp_det'
ckpt_load_path: &ckpt_load_path ./tmp_det/best.ckpt
dataset_root: &dataset_root ./dataset_ic15/det
resume: &resume True
lr: &lr 0.007
resume_epochs: 0
data_shape: (736,1280)

model:
type: det
transform: null
backbone:
name: det_resnet50
pretrained: True
neck:
name: DBFPN
out_channels: 256
bias: False
head:
name: DBHead
k: 50
bias: False
adaptive: True
resume: True
pretrained: False

postprocess:
name: DBPostprocess
box_type: quad # whether to output a polygon or a box
binary_thresh: 0.3 # binarization threshold
box_thresh: 0.6 # box score threshold
max_candidates: 1000
expand_ratio: 1.5 # coefficient for expanding predictions

metric:
name: DetMetric
main_indicator: f-score

loss:
name: DBLoss
eps: 1.0e-6
l1_scale: 10
bce_scale: 5
bce_replace: bceloss

scheduler:
scheduler: polynomial_decay
lr: *lr
num_epochs: *num_epochs
decay_rate: 0.9
warmup_epochs: 3

optimizer:
opt: SGD
filter_bias_and_bn: false
momentum: 0.9
weight_decay: 1.0e-4

only used for mixed precision training

loss_scaler:
type: dynamic
loss_scale: 512
scale_factor: 2
scale_window: 1000

train:
ckpt_save_dir: *ckpt_save_dir
dataset_sink_mode: False
gradient_accumulation_steps: 1
dataset:
type: DetDataset
dataset_root: *dataset_root
data_dir: train/images
label_file: train_gt.txt
sample_ratio: 1.0
transform_pipeline:
- DecodeImage:
img_mode: RGB
to_float32: False
- DetLabelEncode:
- RandomColorAdjust:
brightness: 0.1255 # 32.0 / 255
saturation: 0.5
- RandomHorizontalFlip:
p: 0.5
- RandomRotate:
degrees: [ -10, 10 ]
expand_canvas: False
p: 1.0
- RandomScale:
scale_range: [ 0.5, 3.0 ]
p: 1.0
- RandomCropWithBBox:
max_tries: 10
min_crop_ratio: 0.1
crop_size: [ 640, 640 ]
p: 1.0
- ValidatePolygons:
- ShrinkBinaryMap:
min_text_size: 8
shrink_ratio: 0.4
- BorderMap:
shrink_ratio: 0.4
thresh_min: 0.3
thresh_max: 0.7
- NormalizeImage:
bgr_to_rgb: False
is_hwc: True
mean: imagenet
std: imagenet
- ToCHWImage:
# the order of the dataloader list, matching the network input and the input labels for the loss function, and optional data for debug/visualize
output_columns: [ 'image', 'binary_map', 'mask', 'thresh_map', 'thresh_mask' ] #'img_path']

output_columns: ['image'] # for debug op performance

net_input_column_index: [0] # input indices for network forward func in output_columns
label_column_index: [1, 2, 3, 4] # input indices marked as label

loader:
shuffle: True
batch_size: *batch_size
drop_remainder: True
num_workers: *num_workers

eval:
ckpt_load_path: tmp_det/best.ckpt
dataset_sink_mode: False
dataset:
type: DetDataset
dataset_root: *dataset_root
data_dir: val/images
label_file: test_gt.txt
sample_ratio: 1.0
transform_pipeline:
- DecodeImage:
img_mode: RGB
to_float32: False
- DetLabelEncode:
- DetResize: # GridResize 32
target_size: [ 736, 1280 ]
keep_ratio: False
limit_type: none
divisor: 32
- NormalizeImage:
bgr_to_rgb: False
is_hwc: True
mean: imagenet
std: imagenet
- ToCHWImage:
# the order of the dataloader list, matching the network input and the labels for evaluation
output_columns: [ 'image', 'polys', 'ignore_tags', 'shape_list' ]
net_input_column_index: [0] # input indices for network forward func in output_columns
label_column_index: [1, 2] # input indices marked as label

loader:
shuffle: False
batch_size: 1 # TODO: due to dynamic shape of polygons (num of boxes varies), BS has to be 1
drop_remainder: False
num_workers: 2

4、速度慢目前定位为两部分: ①kernel_meta信息编译大概需要一个多小时；②训练速度慢；请问如何解决呢

panshaowu · Answer 1 · Thu Mar 21 2024 16:54:04 GMT+0800 (China Standard Time)

@zx214 您好，感谢您的反馈。

网络编译耗时超过1小时是不符合预期的。
建议您将MindSpore升级到r2.2.12或r2.2.11，并安装匹配的Ascend CANN包。r2.2.12相较于r2.2.0修复了部分bug，并对性能进行了优化。我们在相同的硬件设备上，使用MindSpore r2.2.11训练DBNet ResNet-50，编译耗时在1-2分钟左右。
训练速度慢。
首先，注意到您的训练epoch数为1，在训练起始阶段，由于需要进行编译加载等操作，会显著较后续epoch慢。
其次，建议将train.dataset_sink_mode设置为True，打开数据下沉模式能够显著提升训练效能。
第三，如果您的CPU单核能力较弱，可尝试将train.dataset.use_minddata设置为True，或尝试适当调高train.loader.num_workers的数值（太高会有反作用），可提升数据预处理计算效率；
最后，可尝试调整system.amp_level为O2或O3，将部分运算切换为Ascend硬件亲和性更好的fp16模式，但部分网络的收敛难度会变大（甚至会不收敛），需要配套调整超参数和train.clip_grad、train.clip_norm等选项。

以下是我们在相同的硬件设备上，使用MindSpore r2.2.11训练DBNet ResNet-50的日志，供您参考：

(MindSpore) [ma-user mindocr-main]$DEVICE_ID=2 python tools/train.py -c configs/det/dbnet/db_r50_icdar15.yaml
[2024-03-21 16:42:06] mindocr.train INFO - Standalone training. Device id: 2, specified by environment variable 'DEVICE_ID'.
[2024-03-21 16:42:11] mindocr.data.builder INFO - Creating dataloader (training=True) for device 0. Number of data samples: 1000 per device (1000 total).
[2024-03-21 16:42:39] mindocr.models.backbones.mindcv_models.utils INFO - Finish loading model checkpoint from: /home/ma-user/.mindspore/models/resnet50-e0733ab8.ckpt
[2024-03-21 16:42:39] mindocr.utils.loss_scaler INFO - gradient_accumulation_steps > 1, scale_factor and scale_window are adjusted accordingly for dynamic loss scaler
[2024-03-21 16:42:39] mindocr.optim.param_grouping INFO - no parameter grouping is applied.
[2024-03-21 16:42:48] mindocr.train INFO -
========================================
Distribute: False
Model: det_resnet50-DBFPN-DBHead
Total number of parameters: 25463362
Total number of trainable parameters: 25409730
Data root: /home/ma-user/work/evaluation-sh/MindOCR/mindocr-main/dataset/
Optimizer: SGD
Weight decay: 0.0001
Batch size: 10
Num devices: 1
Gradient accumulation steps: 2
Global batch size: 10x1x2=20
LR: 0.007
Scheduler: polynomial_decay
Steps per epoch: 100
Num epochs: 1200
Clip gradient: False
EMA: False
AMP level: O0
Loss scaler: {'type': 'dynamic', 'loss_scale': 512, 'scale_factor': 2, 'scale_window': 1000}
Drop overflow update: False
========================================

Start training... (The first epoch takes longer, please wait...)

[2024-03-21 16:44:03] mindocr.utils.callbacks INFO - epoch: [1/1200], loss: 5.056544, epoch time: 74.750 s, per step time: 747.501 ms, fps per card: 13.38 img/s
[2024-03-21 16:44:03] mindocr.utils.callbacks INFO - => Best train_loss: 5.056544303894043, checkpoint saved.
[2024-03-21 16:44:22] mindocr.utils.callbacks INFO - epoch: [2/1200], loss: 4.930840, epoch time: 15.967 s, per step time: 159.667 ms, fps per card: 62.63 img/s
[2024-03-21 16:44:22] mindocr.utils.callbacks INFO - => Best train_loss: 4.930840015411377, checkpoint saved.
[2024-03-21 16:44:41] mindocr.utils.callbacks INFO - epoch: [3/1200], loss: 4.721155, epoch time: 15.963 s, per step time: 159.631 ms, fps per card: 62.64 img/s
[2024-03-21 16:44:42] mindocr.utils.callbacks INFO - => Best train_loss: 4.721155166625977, checkpoint saved.
[2024-03-21 16:45:00] mindocr.utils.callbacks INFO - epoch: [4/1200], loss: 4.198533, epoch time: 15.969 s, per step time: 159.689 ms, fps per card: 62.62 img/s
[2024-03-21 16:45:01] mindocr.utils.callbacks INFO - => Best train_loss: 4.198533058166504, checkpoint saved.
[2024-03-21 16:45:19] mindocr.utils.callbacks INFO - epoch: [5/1200], loss: 4.210309, epoch time: 15.959 s, per step time: 159.593 ms, fps per card: 62.66 img/s
[2024-03-21 16:45:38] mindocr.utils.callbacks INFO - epoch: [6/1200], loss: 3.534603, epoch time: 15.959 s, per step time: 159.588 ms, fps per card: 62.66 img/s
[2024-03-21 16:45:38] mindocr.utils.callbacks INFO - => Best train_loss: 3.5346031188964844, checkpoint saved.
[2024-03-21 16:45:57] mindocr.utils.callbacks INFO - epoch: [7/1200], loss: 3.849508, epoch time: 15.965 s, per step time: 159.652 ms, fps per card: 62.64 img/s
[2024-03-21 16:46:15] mindocr.utils.callbacks INFO - epoch: [8/1200], loss: 3.630200, epoch time: 15.958 s, per step time: 159.576 ms, fps per card: 62.67 img/s
[2024-03-21 16:46:34] mindocr.utils.callbacks INFO - epoch: [9/1200], loss: 3.526414, epoch time: 15.963 s, per step time: 159.625 ms, fps per card: 62.65 img/s
[2024-03-21 16:46:34] mindocr.utils.callbacks INFO - => Best train_loss: 3.526414394378662, checkpoint saved.
[2024-03-21 16:46:53] mindocr.utils.callbacks INFO - epoch: [10/1200], loss: 3.429994, epoch time: 15.957 s, per step time: 159.572 ms, fps per card: 62.67 img/s
[2024-03-21 16:46:53] mindocr.utils.callbacks INFO - => Best train_loss: 3.4299941062927246, checkpoint saved.
[2024-03-21 16:47:12] mindocr.utils.callbacks INFO - epoch: [11/1200], loss: 2.924218, epoch time: 15.966 s, per step time: 159.657 ms, fps per card: 62.63 img/s
[2024-03-21 16:47:12] mindocr.utils.callbacks INFO - => Best train_loss: 2.92421817779541, checkpoint saved.
[2024-03-21 16:47:31] mindocr.utils.callbacks INFO - epoch: [12/1200], loss: 3.281451, epoch time: 15.956 s, per step time: 159.565 ms, fps per card: 62.67 img/s
[2024-03-21 16:47:49] mindocr.utils.callbacks INFO - epoch: [13/1200], loss: 3.589340, epoch time: 15.958 s, per step time: 159.577 ms, fps per card: 62.67 img/s
[2024-03-21 16:48:07] mindocr.utils.callbacks INFO - epoch: [14/1200], loss: 3.343097, epoch time: 15.958 s, per step time: 159.584 ms, fps per card: 62.66 img/s
[2024-03-21 16:48:26] mindocr.utils.callbacks INFO - epoch: [15/1200], loss: 3.465981, epoch time: 15.956 s, per step time: 159.561 ms, fps per card: 62.67 img/s
[2024-03-21 16:48:44] mindocr.utils.callbacks INFO - epoch: [16/1200], loss: 3.380215, epoch time: 15.960 s, per step time: 159.596 ms, fps per card: 62.66 img/s
[2024-03-21 16:49:03] mindocr.utils.callbacks INFO - epoch: [17/1200], loss: 3.600936, epoch time: 15.959 s, per step time: 159.589 ms, fps per card: 62.66 img/s
[2024-03-21 16:49:21] mindocr.utils.callbacks INFO - epoch: [18/1200], loss: 3.199931, epoch time: 15.958 s, per step time: 159.581 ms, fps per card: 62.66 img/s
[2024-03-21 16:49:40] mindocr.utils.callbacks INFO - epoch: [19/1200], loss: 3.428349, epoch time: 15.957 s, per step time: 159.572 ms, fps per card: 62.67 img/s
[2024-03-21 16:49:59] mindocr.utils.callbacks INFO - epoch: [20/1200], loss: 3.225470, epoch time: 15.958 s, per step time: 159.576 ms, fps per card: 62.67 img/s

zx214 · Answer 2 · Fri Mar 22 2024 09:19:55 GMT+0800 (China Standard Time)

感谢，我来试一下

zx214 · Answer 3 · Fri Mar 22 2024 15:38:46 GMT+0800 (China Standard Time)

哈喽，安装官网安装了mindspore2.2.12和CANN7.0.0.beta1，测试源代码模型编译时间没问题，根据任务需求添加了summary的记录和修改数据读取部分，模型部分没有动；但是模型编译时间还是很长，请问模型编译速度慢可能因为什么影响呢

panshaowu · Answer 4 · Fri Mar 22 2024 16:03:55 GMT+0800 (China Standard Time)

@zx214 您好。根据您的描述，似乎是从启动Python脚本，到开始训练计算的耗时较长，而不一定是MindSpore的静态图模型编译耗时较长。

建议您考虑使用以下方案，定界出真正耗时较长的代码：

将新增的Summary模块、数据加载模块相关代码，从项目中剥离出来，进行单元测试，测算耗时；
在项目代码tools/main.py中，使用Python的time包，计时分析各阶段耗时。

根据您的描述，耗时较长的原因可能有：

Summary写入过多冗余信息，导致耗时过长；
加载数据的效率较低，如数据集的单样本分辨率很高，或硬盘读写速率较低；
MindSpore Graph/静态图首次执行图编译会消耗相对多的时间，非首次执行图编译则会复用之前生成的编译缓存。如果您每次手工删除了编译缓存，会导致编译时间变长。

zx214 · Answer 5 · Fri Mar 22 2024 17:37:34 GMT+0800 (China Standard Time)

好的，我定位一下，再向您请教个问题在2卡训练中，输出这样的日志，怎样原因导致的，如何解决呢

panshaowu · Answer 6 · Wed Mar 27 2024 11:13:18 GMT+0800 (China Standard Time)

@zx214 您好，根据您反馈的错误日志。因为MindSpore源代码中未包含相关错误信息，该问题可能不是MindSpore引入的。建议您检查一下Ascend的环境配置，包括硬件是否安装正常，CANN包是否安装正确（例如安装完CANN后是否安装了te和hccl的whl包）。