Mistral Micro Eval Crashes With DeepSpeed
J38 opened this issue · comments
Running the mistral-micro.yaml
example, eval crashes. Sample output:
/nlp/scr/jebolton/miniconda3/envs/mistral/lib/python3.8/site-packages/transformers/trainer.py:2543: RuntimeWarning: Mean of empty slice.
metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item()
/nlp/scr/jebolton/miniconda3/envs/mistral/lib/python3.8/site-packages/numpy/core/_methods.py:190: RuntimeWarning: invalid value encountered in divide
ret = ret.dtype.type(ret / rcount)
{'eval_loss': nan, 'eval_runtime': 0.5983, 'eval_samples_per_second': 0.0, 'eval_steps_per_second': 0.0, 'epoch': 0.0}
This is with
deepspeed==0.6.5
torch==1.11.0
transformers==4.18..0
Full command:
deepspeed --num_gpus 8 --num_nodes 1 --master_addr localhost --hostfile hostfile train.py --config conf/mistral-micro.yaml --nnodes 1 --nproc_per_node 8 --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 4 --training_arguments.deepspeed conf/deepspeed/z2-small-conf.json --run_id mistral-micro-deepspeed-8gpu
fixed by #170