finetue_cpm_bee.py 当前支持模型并行训练吗，传参应该怎么设置呢？

Question

finetue_cpm_bee.py 当前支持模型并行训练吗，传参应该怎么设置呢？

diaojunxian opened this issue a year ago · comments

当前运行机器有4张3090卡，但是通过指令运行增量微调的时候，报错；

export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32

torchrun --nnodes=1 --nproc_per_node=4 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:12345 finetune_cpm_bee.py --use-delta --model-config /home/CPM-Bee/src/config/cpm-bee-10b.json xxxxx

OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 23.70 GiB total capacity; 22.86 GiB already allocated; 64.44 MiB free; 23.16 GiB reserved in total by 
PyTorch) If reserved memory is

看起来当前的并发是基于 ddp 的数据并发运行机制，不清楚，是否当前 finetue_cpm_bee.py 支持模型并发的运行训练机制？

Zhi Zheng · Answer 1 · Fri Jun 16 2023 11:33:37 GMT+0800 (China Standard Time)

当前的并发不是数据并发，默认是ZeRO-3。