[ecapa-tdnn] [Ascend] The code of distributed script need to modify

Question

[ecapa-tdnn] [Ascend] The code of distributed script need to modify

787918582 opened this issue 9 months ago · comments

If this is your first time, please read our contributor guidelines:
https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md

Describe the bug/ 问题描述 (Mandatory / 必填)
当前run_distribute_train_ascend.sh代码中卡0日志无法保存

Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend

Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :commit_id = '[sha1]:8a30fd67,[branch]:(HEAD,origin/master,origin/HEAD,master)'
-- Python version (e.g., Python 3.7.5) :3.7.5
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):Ubuntu
-- GCC/Compiler version (if compiled from source):7.3.0
Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph

To Reproduce / 重现步骤 (Mandatory / 必填)
Steps to reproduce the behavior:

bash run_distribute_train_ascend.sh /data3/zl/Mindlab_data/dataset/hccl_8p.json

Expected behavior / 预期结果 (Mandatory / 必填)
分布式训练卡0日志可以保存

Screenshots/ 日志 / 截图 (Mandatory / 必填)
if [ $# != 1 ]
then
echo "Usage: bash run_distribute_train.sh [RANK_TABLE_FILE]"
exit 1
fi

export RANK_TABLE_FILE=$1
export DEVICE_NUM=8
export RANK_SIZE=8

if [ ! -f $1 ]
then
echo "RANK_TABLE_FILE Does Not Exist!"
exit 1
fi

for((i=1; i<${DEVICE_NUM}; i++))
do
export DEVICE_ID=$i
export RANK_ID=$i
rm -rf ./train_parallel$i
mkdir ./train_parallel$i
cp ./.py ./train_parallel$i
cp ./.yaml ./train_parallel$i
cd ./train_parallel$i || exit
echo "start training for rank $RANK_ID, device $DEVICE_ID"
env > env.log
python train_speaker_embeddings.py --need_generate_data=False --run_distribute=1 > train.log 2>&1 &
cd ..
done
export DEVICE_ID=0
export RANK_ID=0
rm -rf ./train_parallel0
mkdir ./train_parallel0
cp ./.py ./train_parallel0
cp ./.yaml ./train_parallel0
cd ./train_parallel0 || exit
echo "start training for rank $RANK_ID, device $DEVICE_ID"
env > env.log
python train_speaker_embeddings.py --need_generate_data=False --run_distribute=1 2>&1
cd ..

Additional context / 备注 (Optional / 选填)
Add any other context about the problem here.

vigo999 · Answer 1 · Mon Sep 25 2023 15:22:58 GMT+0800 (China Standard Time)

please check @LiTingyu1997