[ecapa-tdnn] [Ascend] The code of distributed script need to modify
787918582 opened this issue · comments
If this is your first time, please read our contributor guidelines:
https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md
Describe the bug/ 问题描述 (Mandatory / 必填)
当前run_distribute_train_ascend.sh代码中卡0日志无法保存
- Hardware Environment(
Ascend
/GPU
/CPU
) / 硬件环境:
Please delete the backend not involved / 请删除不涉及的后端:
/device ascend
-
Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :commit_id = '[sha1]:8a30fd67,[branch]:(HEAD,origin/master,origin/HEAD,master)'
-- Python version (e.g., Python 3.7.5) :3.7.5
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):Ubuntu
-- GCC/Compiler version (if compiled from source):7.3.0 -
Excute Mode / 执行模式 (Mandatory / 必填)(
PyNative
/Graph
):
Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph
To Reproduce / 重现步骤 (Mandatory / 必填)
Steps to reproduce the behavior:
- bash run_distribute_train_ascend.sh /data3/zl/Mindlab_data/dataset/hccl_8p.json
Expected behavior / 预期结果 (Mandatory / 必填)
分布式训练卡0日志可以保存
Screenshots/ 日志 / 截图 (Mandatory / 必填)
if [ $# != 1 ]
then
echo "Usage: bash run_distribute_train.sh [RANK_TABLE_FILE]"
exit 1
fi
export RANK_TABLE_FILE=$1
export DEVICE_NUM=8
export RANK_SIZE=8
if [ ! -f $1 ]
then
echo "RANK_TABLE_FILE Does Not Exist!"
exit 1
fi
for((i=1; i<${DEVICE_NUM}; i++))
do
export DEVICE_ID=$i
export RANK_ID=$i
rm -rf ./train_parallel$i
mkdir ./train_parallel$i
cp ./.py ./train_parallel$i
cp ./.yaml ./train_parallel$i
cd ./train_parallel$i || exit
echo "start training for rank $RANK_ID, device $DEVICE_ID"
env > env.log
python train_speaker_embeddings.py --need_generate_data=False --run_distribute=1 > train.log 2>&1 &
cd ..
done
export DEVICE_ID=0
export RANK_ID=0
rm -rf ./train_parallel0
mkdir ./train_parallel0
cp ./.py ./train_parallel0
cp ./.yaml ./train_parallel0
cd ./train_parallel0 || exit
echo "start training for rank $RANK_ID, device $DEVICE_ID"
env > env.log
python train_speaker_embeddings.py --need_generate_data=False --run_distribute=1 2>&1
cd ..
Additional context / 备注 (Optional / 选填)
Add any other context about the problem here.
please check @LiTingyu1997