stanford-crfm / mistral

Mistral: A strong, northwesterly wind: Framework for transparent and accessible large-scale language model training, built with Hugging Face 🤗 Transformers.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Indexed Dataset caches contain absolute path references

jthickstun opened this issue · comments

Describe the bug

Cached dataset artifacts appear to reference absolute paths, which makes it difficult to transfer these caches across machines. If we create caches on one machine and later copy these caches to a new machine (with a different path) the DataLoaders will crash at the beginning of training, reporting FileNotFoundError with references to absolute paths for parquet files on the original machine.

To Reproduce

Start a training run on machine A, specifying the following cache directory in the json config:

artifacts: cache_dir: /machineA/scr0/username/artifacts

Kill the run after preprocessing is complete. Copy the cache directory to machine B:

cp -r /machineA/scr0/username/artifacts /machineB/scr0/username/artifacts

Start a training run on machine B using the new cache directory:

artifacts: cache_dir: /machineB/scr0/username/artifacts

The copy of the cache is found, skipping preprocessing, but training will subsequently crash when the DataLoaders throws FileNotFoundError looking for files on machine A:

  File "[...]/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "[...]/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
    data.append(next(self.dataset_iter))
  File "[...]/lib/python3.8/site-packages/torch/utils/data/_typing.py", line 366, in wrap_generator
    response = gen.send(None)
  File "[...]/mistral/src/corpora/tokenization_utils.py", line 114, in __iter__
    for x in self.datapipe:
  File "[...]/lib/python3.8/site-packages/torch/utils/data/_typing.py", line 366, in wrap_generator
    response = gen.send(None)
  File "[...]/mistral/src/corpora/indexer.py", line 61, in __iter__
    for entry in read_cache_file(file_name, flatten=True):
  File "[...]/mistral/src/corpora/indexer.py", line 153, in read_cache_file
    for b in pq.read_table(file).to_batches():
  File "[...]/lib/python3.8/site-packages/pyarrow/parquet/__init__.py", line 2737, in read_table
    dataset = _ParquetDatasetV2(
  File "[...]/lib/python3.8/site-packages/pyarrow/parquet/__init__.py", line 2351, in __init__
    self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
  File "[...]/lib/python3.8/site-packages/pyarrow/dataset.py", line 694, in dataset
    return _filesystem_dataset(source, **kwargs)
  File "[...]/lib/python3.8/site-packages/pyarrow/dataset.py", line 439, in _filesystem_dataset
    fs, paths_or_selector = _ensure_single_source(source, filesystem)
  File "[...]/lib/python3.8/site-packages/pyarrow/dataset.py", line 415, in _ensure_single_source
    raise FileNotFoundError(path)
FileNotFoundError: /machineA/scr0/username/artifacts/gpt2-micro-processed/gpt2-src/corpora/gpt2.py/preprocessing/tokenization/train-tokenized/docs-0.parquet

Expected behavior

Training should not crash.

Additional context

I've always specified artifacts: cache_dir: using an absolute path with machine name. A possible workaround is to use a relative path, or use an absolute path that is consistent across machines: e.g., on the internal cluster using a path prefixed with /scr/biggest might work (I'm not sure whether that mount point exists on all machines?).

ok i'll change it to be relative to the ledger file and that should fix everything I think?

Confirmed! Yes, that ought to fix it. I just tried a manual find/replace on paths in the ledger file and kicked off a new run: everything is running as expected.