FileNotFoundError thrown when training
obicons opened this issue · comments
Describe the bug
I am able to train on a single node using the 19M.yml file. But I am unable to train in a distributed setting using any of the provided configurations as a starting point. I always receive this error (800M config):
172.31.20.74: File "/home/ubuntu/gpt-neox/megatron/data/data_utils.py", line 133, in build_dataset
172.31.21.158: File "/home/ubuntu/gpt-neox/megatron/training.py", line 203, in pretrain
172.31.18.171: File "/home/ubuntu/gpt-neox/megatron/data/gpt2_dataset.py", line 231, in _build_index_mappings
172.31.21.63: File "/home/ubuntu/gpt-neox/megatron/data/data_utils.py", line 145, in build_train_valid_test_datasets
172.31.21.63: train_dataset = build_dataset(0, "train")
172.31.16.64: File "/home/ubuntu/gpt-neox/megatron/data/data_utils.py", line 408, in build_train_valid_test_data_iterators
172.31.20.74: dataset = GPT2Dataset(
172.31.21.158: ) = build_train_valid_test_data_iterators(neox_args=neox_args)
172.31.18.171: doc_idx = np.load(doc_idx_filename, allow_pickle=True, mmap_mode="r")
172.31.21.63: File "/home/ubuntu/gpt-neox/megatron/data/data_utils.py", line 133, in build_dataset
172.31.20.74: File "/home/ubuntu/gpt-neox/megatron/data/gpt2_dataset.py", line 54, in __init__
172.31.16.64: train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
172.31.21.158: File "/home/ubuntu/gpt-neox/megatron/data/data_utils.py", line 408, in build_train_valid_test_data_iterators
172.31.16.64: File "/home/ubuntu/gpt-neox/megatron/data/data_utils.py", line 145, in build_train_valid_test_datasets
172.31.18.171: File "/home/ubuntu/.venv/lib/python3.9/site-packages/numpy/lib/npyio.py", line 427, in load
172.31.21.63: dataset = GPT2Dataset(
172.31.20.74: self.doc_idx, self.sample_idx, self.shuffle_idx = _build_index_mappings(
172.31.21.158: train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
172.31.21.158: File "/home/ubuntu/gpt-neox/megatron/data/data_utils.py", line 145, in build_train_valid_test_datasets
172.31.16.64: train_dataset = build_dataset(0, "train")
172.31.21.63: File "/home/ubuntu/gpt-neox/megatron/data/gpt2_dataset.py", line 54, in __init__
172.31.21.63: self.doc_idx, self.sample_idx, self.shuffle_idx = _build_index_mappings(
172.31.21.63: File "/home/ubuntu/gpt-neox/megatron/data/gpt2_dataset.py", line 231, in _build_index_mappings
172.31.21.63: doc_idx = np.load(doc_idx_filename, allow_pickle=True, mmap_mode="r")
172.31.21.63: File "/home/ubuntu/.venv/lib/python3.9/site-packages/numpy/lib/npyio.py", line 427, in load
172.31.18.171: fid = stack.enter_context(open(os_fspath(file), "rb"))
172.31.20.74: File "/home/ubuntu/gpt-neox/megatron/data/gpt2_dataset.py", line 231, in _build_index_mappings
172.31.20.74: doc_idx = np.load(doc_idx_filename, allow_pickle=True, mmap_mode="r")
172.31.20.74: File "/home/ubuntu/.venv/lib/python3.9/site-packages/numpy/lib/npyio.py", line 427, in load
172.31.20.74: fid = stack.enter_context(open(os_fspath(file), "rb"))
172.31.20.74: FileNotFoundError: [Errno 2] No such file or directory: 'data/enwik8/enwik8_text_document_train_indexmap_18304000ns_2048sl_1234s_doc_idx.npy'
The complete output is really long, but there is no error until this point.
The configurations are completely unmodified, except that I added a hostfile
setting into the configuration.
OS: Ubuntu 20.04
Python: 3.9
All nodes have separate filesystems. Each node can individually train the 19M config. Every node is accessible via ssh.
Expected behavior
Training proceeds without throwing this exception.
Proposed solution
N/A
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
- GPUs: 8 servers, 1 NVIDIA T4 GPU on each server
- Configs: configs/800M.yml
Do you have something at data/enwik8/enwik8_text_document_train_indexmap_18304000ns_2048sl_1234s_doc_idx.npy
for the model to train on?
No. There was a similarly named file, though. I don't know where this filename derives from, since it doesn't appear in any configuration that I edited.
No. There was a similarly named file, though. I don't know where this filename derives from, since it doesn't appear in any configuration that I edited.
You should have a file called enwik8_text_document
. The exact file you're referring to was created by the preprocessing script, and the suffix refers to the configuration of the preprocessing script.
I would try removing all preprocessed files and try running that again.
All nodes have separate filesystems. Each node can individually train the 19M config. Every node is accessible via ssh.
I think this is the issue--most clusters we run GPT-NeoX on have a shared filesystem across nodes.
I believe adding "use_shared_fs": False
should suffice to fix your issue! It causes all nodes' local rank 0 to build these intermediate .npy files.
@obicons -- Please reopen if using "use_shared_fs": False
doesn't resolve this for you