pyarrow dependency should be >= 7.0.0
yifanmai opened this issue · comments
Describe the bug
Mistral uses pyarrow.parquet.ParquetWriter.write_batch
, which was added in pyarrow 7.0. See docs for 6.0 and docs for 7.0.
To Reproduce
When following the instructions to run train GPT-2 Micro, I get the following:
Traceback (most recent call last):
File "train.py", line 265, in <module>
train()
File "train.py", line 123, in train
custom_eval_datasets, lm_dataset = load_datasets(quinfig, paths, tokenizer, overwatch)
File "train.py", line 195, in load_datasets
lm_dataset = build_indexed_dataset(
File "/home/yifanmai/oss/mistral/src/corpora/auto.py", line 84, in build_indexed_dataset
out_datasets[k] = IndexedDataset.build_or_load(token_iter, post_tokenization_cache_files[k], seq_len, stride) # type: ignore
File "/home/yifanmai/oss/mistral/src/corpora/indexer.py", line 125, in build_or_load
current_writer.write_batch(batch)
AttributeError: 'ParquetWriter' object has no attribute 'write_batch'
The installed version of pyarrow was 5.0.0.
Expected behavior
It should work.