Multi-machine Data Processing Fails
J38 opened this issue · comments
Here's the command when trying on 4 machines:
deepspeed --num_gpus 8 --num_nodes 4 --master_addr sphinx5 --hostfile hostfile train.py --config conf/mistral-32gpu.yaml --nnodes 4 --nproc_per_node 8 --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 4 --training_arguments.deepspeed conf/deepspeed/z2-small-conf.json --run_id gpt2-32gpu-demo
Here's the crash error:
sphinx5: File "train.py", line 264, in <module>
sphinx5: Traceback (most recent call last):
sphinx5: File "train.py", line 264, in <module>
sphinx5: Traceback (most recent call last):
sphinx5: File "train.py", line 264, in <module>
sphinx5: Traceback (most recent call last):
sphinx5: Traceback (most recent call last):
sphinx5: Traceback (most recent call last):
sphinx5: Traceback (most recent call last):
sphinx5: File "train.py", line 264, in <module>
sphinx5: File "train.py", line 264, in <module>
sphinx5: File "train.py", line 264, in <module>
sphinx5: File "train.py", line 264, in <module>
sphinx5: train()
sphinx5: File "train.py", line 122, in train
sphinx5: train()
sphinx5: File "train.py", line 122, in train
sphinx5: train()
sphinx5: File "train.py", line 122, in train
sphinx5: custom_eval_datasets, lm_dataset = load_datasets(quinfig, paths, tokenizer, overwatch)
sphinx5: File "train.py", line 191, in load_datasets
sphinx5: train()
sphinx5: File "train.py", line 122, in train
sphinx5: custom_eval_datasets, lm_dataset = load_datasets(quinfig, paths, tokenizer, overwatch)
sphinx5: File "train.py", line 191, in load_datasets
sphinx5: train()
sphinx5: _preprocess_once_per_machine(quinfig, paths, tokenizer, overwatch)
sphinx5: File "train.py", line 256, in _preprocess_once_per_machine
sphinx5: File "train.py", line 122, in train
sphinx5: train()custom_eval_datasets, lm_dataset = load_datasets(quinfig, paths, tokenizer, overwatch)
sphinx5:
sphinx5: File "train.py", line 191, in load_datasets
sphinx5: custom_eval_datasets, lm_dataset = load_datasets(quinfig, paths, tokenizer, overwatch)raise RuntimeError(f"Forked process exited with status {status[0]}") File "train.py", line 122, in train
sphinx5:
sphinx5:
sphinx5: File "train.py", line 191, in load_datasets
sphinx5: RuntimeError_preprocess_once_per_machine(quinfig, paths, tokenizer, overwatch) : Traceback (most recent call last):
sphinx5:
sphinx5: Forked process exited with status -7train()
sphinx5: File "train.py", line 256, in _preprocess_once_per_machine
sphinx5:
sphinx5: File "train.py", line 264, in <module>
sphinx5: File "train.py", line 122, in train
sphinx5: custom_eval_datasets, lm_dataset = load_datasets(quinfig, paths, tokenizer, overwatch)
sphinx5: File "train.py", line 191, in load_datasets
sphinx5: _preprocess_once_per_machine(quinfig, paths, tokenizer, overwatch)
sphinx5: File "train.py", line 256, in _preprocess_once_per_machine
sphinx5: _preprocess_once_per_machine(quinfig, paths, tokenizer, overwatch)
sphinx5: File "train.py", line 256, in _preprocess_once_per_machine
sphinx5: custom_eval_datasets, lm_dataset = load_datasets(quinfig, paths, tokenizer, overwatch)
sphinx5: File "train.py", line 191, in load_datasets
sphinx5: raise RuntimeError(f"Forked process exited with status {status[0]}")
sphinx5: custom_eval_datasets, lm_dataset = load_datasets(quinfig, paths, tokenizer, overwatch)
sphinx5: File "train.py", line 191, in load_datasets
sphinx5: RuntimeError: Forked process exited with status -7
sphinx5: _preprocess_once_per_machine(quinfig, paths, tokenizer, overwatch)
sphinx5: File "train.py", line 256, in _preprocess_once_per_machine
sphinx5: _preprocess_once_per_machine(quinfig, paths, tokenizer, overwatch)raise RuntimeError(f"Forked process exited with status {status[0]}")