pyarrow dependency should be >= 7.0.0

Question

pyarrow dependency should be >= 7.0.0

yifanmai opened this issue 2 years ago · comments

Describe the bug

Mistral uses pyarrow.parquet.ParquetWriter.write_batch, which was added in pyarrow 7.0. See docs for 6.0 and docs for 7.0.

To Reproduce

When following the instructions to run train GPT-2 Micro, I get the following:

Traceback (most recent call last):
  File "train.py", line 265, in <module>
    train()
  File "train.py", line 123, in train
    custom_eval_datasets, lm_dataset = load_datasets(quinfig, paths, tokenizer, overwatch)
  File "train.py", line 195, in load_datasets
    lm_dataset = build_indexed_dataset(
  File "/home/yifanmai/oss/mistral/src/corpora/auto.py", line 84, in build_indexed_dataset
    out_datasets[k] = IndexedDataset.build_or_load(token_iter, post_tokenization_cache_files[k], seq_len, stride)  # type: ignore
  File "/home/yifanmai/oss/mistral/src/corpora/indexer.py", line 125, in build_or_load
    current_writer.write_batch(batch)
AttributeError: 'ParquetWriter' object has no attribute 'write_batch'

The installed version of pyarrow was 5.0.0.

Expected behavior

It should work.