Mistral Micro Eval Crashes With DeepSpeed

Question

Mistral Micro Eval Crashes With DeepSpeed

J38 opened this issue 2 years ago · comments

Running the mistral-micro.yaml example, eval crashes. Sample output:

/nlp/scr/jebolton/miniconda3/envs/mistral/lib/python3.8/site-packages/transformers/trainer.py:2543: RuntimeWarning: Mean of empty slice.
  metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item()
/nlp/scr/jebolton/miniconda3/envs/mistral/lib/python3.8/site-packages/numpy/core/_methods.py:190: RuntimeWarning: invalid value encountered in divide
  ret = ret.dtype.type(ret / rcount)
{'eval_loss': nan, 'eval_runtime': 0.5983, 'eval_samples_per_second': 0.0, 'eval_steps_per_second': 0.0, 'epoch': 0.0}

This is with

deepspeed==0.6.5
torch==1.11.0
transformers==4.18..0

J38 · Answer 1 · Tue Jul 19 2022 15:29:34 GMT+0800 (China Standard Time)

Full command:

deepspeed --num_gpus 8 --num_nodes 1 --master_addr localhost --hostfile hostfile train.py --config conf/mistral-micro.yaml --nnodes 1 --nproc_per_node 8 --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 4 --training_arguments.deepspeed conf/deepspeed/z2-small-conf.json  --run_id mistral-micro-deepspeed-8gpu

David Hall · Answer 2 · Wed Aug 10 2022 07:50:46 GMT+0800 (China Standard Time)

fixed by #170