Using anything > 2048 for batch_max_length during training results in cuda index errors
corey-lambda opened this issue · comments
This is on a machine with 8 A100 with 80gb each.
Dataset is https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset/resolve/main/sharegpt_clean.json, converted to the conversation format as indicated in the readme, and then tokenized using the command:
python -m ochat.data.generate_dataset --model-type openchat_v3.2 --model-path imone/LLaMA2_7B_with_EOT_token --in-files sharegpt_clean.jsonl --out-prefix .
Training command used:
deepspeed --num_gpus=8 --module ochat.training_deepspeed.train \
--model_path imone/LLaMA2_7B_with_EOT_token \
--data_prefix ./data/ \
--save_path ./checkpoints/llama2-7b/ \
--batch_max_len 4096 \
--epochs 5 \
--save_every 1 \
--deepspeed \
--deepspeed_config deepspeed_config.json \
> info.log \
2> error.log
Here are the stdio & stderr log files when i run the above:
error.log
info.log
Update: The recommended batch_max_length works if you select --model_path imone/Mistral_7B_with_EOT_token