预训练中deepspeed版本问题

Question

预训练中deepspeed版本问题

twwch opened this issue a year ago · comments

想问一下deepspeed版本

deepspeed train.py     --model_name_or_path /public/home/chenhao/models/llama-13b-hf     --model_max_length 1024     --data_path ./data/data/data     --output_dir ./output     --num_train_epochs 1     --per_device_train_batch_size 16     --per_device_eval_batch_size 1     --evaluation_strategy "no"     --save_strategy "steps"     --save_steps 100     --save_total_limit 1     --learning_rate 1.5e-5     --warmup_steps 300     --logging_steps 1     --report_to "tensorboard"     --gradient_checkpointing True     --deepspeed configs/config.json     --fp16 True     --log_on_each_node False     --lr_scheduler_type "cosine"     --adam_beta1 0.9 --adam_beta2 0.95 --weight_decay 0.1

Mike Dean · Answer 1 · Fri Jun 02 2023 12:21:21 GMT+0800 (China Standard Time)

您好，我们使用的Deepspeed的版本为0.8.3。下面是我们的环境下执行ds_report后的结果：

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/root/anaconda3/envs/llama/lib/python3.10/site-packages/torch']
torch version .................... 1.12.0
deepspeed install path ........... ['/root/anaconda3/envs/llama/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.8.3+4d27225f, 4d27225f, master
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

Dalao · Answer 2 · Sat Jun 03 2023 15:14:02 GMT+0800 (China Standard Time)

这跟deepspeed没关系。。。这就是AMD的cpu压根不支持。。。。。换intel吧朋友

pillow · Answer 3 · Sat Jun 03 2023 21:25:53 GMT+0800 (China Standard Time)

您好，请您查看一下gcc版本是否低于10，如果高于10，请尝试将gcc版本降为7。