Multi-machine Data Processing Fails

Question

Multi-machine Data Processing Fails

J38 opened this issue 2 years ago · comments

Here's the command when trying on 4 machines:

deepspeed --num_gpus 8 --num_nodes 4 --master_addr sphinx5 --hostfile hostfile train.py --config conf/mistral-32gpu.yaml --nnodes 4 --nproc_per_node 8 --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 4 --training_arguments.deepspeed conf/deepspeed/z2-small-conf.json --run_id gpt2-32gpu-demo

Here's the crash error:

sphinx5:   File "train.py", line 264, in <module>
sphinx5: Traceback (most recent call last):
sphinx5:   File "train.py", line 264, in <module>
sphinx5: Traceback (most recent call last):
sphinx5:   File "train.py", line 264, in <module>
sphinx5: Traceback (most recent call last):
sphinx5: Traceback (most recent call last):
sphinx5: Traceback (most recent call last):
sphinx5: Traceback (most recent call last):
sphinx5:   File "train.py", line 264, in <module>
sphinx5:   File "train.py", line 264, in <module>
sphinx5:   File "train.py", line 264, in <module>
sphinx5:   File "train.py", line 264, in <module>
sphinx5:     train()
sphinx5:   File "train.py", line 122, in train
sphinx5:     train()
sphinx5:   File "train.py", line 122, in train
sphinx5:     train()
sphinx5:   File "train.py", line 122, in train
sphinx5:     custom_eval_datasets, lm_dataset = load_datasets(quinfig, paths, tokenizer, overwatch)
sphinx5:   File "train.py", line 191, in load_datasets
sphinx5:     train()
sphinx5:   File "train.py", line 122, in train
sphinx5:     custom_eval_datasets, lm_dataset = load_datasets(quinfig, paths, tokenizer, overwatch)
sphinx5:   File "train.py", line 191, in load_datasets
sphinx5:     train()    
sphinx5: _preprocess_once_per_machine(quinfig, paths, tokenizer, overwatch)
sphinx5:   File "train.py", line 256, in _preprocess_once_per_machine
sphinx5:   File "train.py", line 122, in train
sphinx5:         train()custom_eval_datasets, lm_dataset = load_datasets(quinfig, paths, tokenizer, overwatch)
sphinx5: 
sphinx5:   File "train.py", line 191, in load_datasets
sphinx5:         custom_eval_datasets, lm_dataset = load_datasets(quinfig, paths, tokenizer, overwatch)raise RuntimeError(f"Forked process exited with status {status[0]}")  File "train.py", line 122, in train
sphinx5: 
sphinx5: 
sphinx5:   File "train.py", line 191, in load_datasets
sphinx5:     RuntimeError_preprocess_once_per_machine(quinfig, paths, tokenizer, overwatch)    : Traceback (most recent call last):
sphinx5: 
sphinx5: Forked process exited with status -7train()
sphinx5:   File "train.py", line 256, in _preprocess_once_per_machine
sphinx5: 
sphinx5:   File "train.py", line 264, in <module>
sphinx5:   File "train.py", line 122, in train
sphinx5:     custom_eval_datasets, lm_dataset = load_datasets(quinfig, paths, tokenizer, overwatch)
sphinx5:   File "train.py", line 191, in load_datasets
sphinx5:     _preprocess_once_per_machine(quinfig, paths, tokenizer, overwatch)
sphinx5:   File "train.py", line 256, in _preprocess_once_per_machine
sphinx5:     _preprocess_once_per_machine(quinfig, paths, tokenizer, overwatch)
sphinx5:   File "train.py", line 256, in _preprocess_once_per_machine
sphinx5:     custom_eval_datasets, lm_dataset = load_datasets(quinfig, paths, tokenizer, overwatch)
sphinx5:   File "train.py", line 191, in load_datasets
sphinx5:     raise RuntimeError(f"Forked process exited with status {status[0]}")
sphinx5:     custom_eval_datasets, lm_dataset = load_datasets(quinfig, paths, tokenizer, overwatch)
sphinx5:   File "train.py", line 191, in load_datasets
sphinx5: RuntimeError: Forked process exited with status -7
sphinx5:     _preprocess_once_per_machine(quinfig, paths, tokenizer, overwatch)
sphinx5:   File "train.py", line 256, in _preprocess_once_per_machine
sphinx5:         _preprocess_once_per_machine(quinfig, paths, tokenizer, overwatch)raise RuntimeError(f"Forked process exited with status {status[0]}")