EleutherAI / gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

Home Page:https://www.eleuther.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FileNotFoundError thrown when training

obicons opened this issue · comments

Describe the bug
I am able to train on a single node using the 19M.yml file. But I am unable to train in a distributed setting using any of the provided configurations as a starting point. I always receive this error (800M config):

172.31.20.74:   File "/home/ubuntu/gpt-neox/megatron/data/data_utils.py", line 133, in build_dataset
172.31.21.158:   File "/home/ubuntu/gpt-neox/megatron/training.py", line 203, in pretrain
172.31.18.171:   File "/home/ubuntu/gpt-neox/megatron/data/gpt2_dataset.py", line 231, in _build_index_mappings
172.31.21.63:   File "/home/ubuntu/gpt-neox/megatron/data/data_utils.py", line 145, in build_train_valid_test_datasets
172.31.21.63:     train_dataset = build_dataset(0, "train")
172.31.16.64:   File "/home/ubuntu/gpt-neox/megatron/data/data_utils.py", line 408, in build_train_valid_test_data_iterators
172.31.20.74:     dataset = GPT2Dataset(
172.31.21.158:     ) = build_train_valid_test_data_iterators(neox_args=neox_args)
172.31.18.171:     doc_idx = np.load(doc_idx_filename, allow_pickle=True, mmap_mode="r")
172.31.21.63:   File "/home/ubuntu/gpt-neox/megatron/data/data_utils.py", line 133, in build_dataset
172.31.20.74:   File "/home/ubuntu/gpt-neox/megatron/data/gpt2_dataset.py", line 54, in __init__
172.31.16.64:     train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
172.31.21.158:   File "/home/ubuntu/gpt-neox/megatron/data/data_utils.py", line 408, in build_train_valid_test_data_iterators
172.31.16.64:   File "/home/ubuntu/gpt-neox/megatron/data/data_utils.py", line 145, in build_train_valid_test_datasets
172.31.18.171:   File "/home/ubuntu/.venv/lib/python3.9/site-packages/numpy/lib/npyio.py", line 427, in load
172.31.21.63:     dataset = GPT2Dataset(
172.31.20.74:     self.doc_idx, self.sample_idx, self.shuffle_idx = _build_index_mappings(
172.31.21.158:     train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
172.31.21.158:   File "/home/ubuntu/gpt-neox/megatron/data/data_utils.py", line 145, in build_train_valid_test_datasets
172.31.16.64:     train_dataset = build_dataset(0, "train")
172.31.21.63:   File "/home/ubuntu/gpt-neox/megatron/data/gpt2_dataset.py", line 54, in __init__
172.31.21.63:     self.doc_idx, self.sample_idx, self.shuffle_idx = _build_index_mappings(
172.31.21.63:   File "/home/ubuntu/gpt-neox/megatron/data/gpt2_dataset.py", line 231, in _build_index_mappings
172.31.21.63:     doc_idx = np.load(doc_idx_filename, allow_pickle=True, mmap_mode="r")
172.31.21.63:   File "/home/ubuntu/.venv/lib/python3.9/site-packages/numpy/lib/npyio.py", line 427, in load
172.31.18.171:     fid = stack.enter_context(open(os_fspath(file), "rb"))
172.31.20.74:   File "/home/ubuntu/gpt-neox/megatron/data/gpt2_dataset.py", line 231, in _build_index_mappings
172.31.20.74:     doc_idx = np.load(doc_idx_filename, allow_pickle=True, mmap_mode="r")
172.31.20.74:   File "/home/ubuntu/.venv/lib/python3.9/site-packages/numpy/lib/npyio.py", line 427, in load
172.31.20.74:     fid = stack.enter_context(open(os_fspath(file), "rb"))
172.31.20.74: FileNotFoundError: [Errno 2] No such file or directory: 'data/enwik8/enwik8_text_document_train_indexmap_18304000ns_2048sl_1234s_doc_idx.npy'

The complete output is really long, but there is no error until this point.

The configurations are completely unmodified, except that I added a hostfile setting into the configuration.

OS: Ubuntu 20.04
Python: 3.9

All nodes have separate filesystems. Each node can individually train the 19M config. Every node is accessible via ssh.

Expected behavior
Training proceeds without throwing this exception.

Proposed solution
N/A

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • GPUs: 8 servers, 1 NVIDIA T4 GPU on each server
  • Configs: configs/800M.yml

Do you have something at data/enwik8/enwik8_text_document_train_indexmap_18304000ns_2048sl_1234s_doc_idx.npy for the model to train on?

No. There was a similarly named file, though. I don't know where this filename derives from, since it doesn't appear in any configuration that I edited.

No. There was a similarly named file, though. I don't know where this filename derives from, since it doesn't appear in any configuration that I edited.

You should have a file called enwik8_text_document. The exact file you're referring to was created by the preprocessing script, and the suffix refers to the configuration of the preprocessing script.

I would try removing all preprocessed files and try running that again.

All nodes have separate filesystems. Each node can individually train the 19M config. Every node is accessible via ssh.

I think this is the issue--most clusters we run GPT-NeoX on have a shared filesystem across nodes.

I believe adding "use_shared_fs": False should suffice to fix your issue! It causes all nodes' local rank 0 to build these intermediate .npy files.

@obicons -- Please reopen if using "use_shared_fs": False doesn't resolve this for you