finetune数据使用data_collator时报错 KeyError:seq_len
michael0905 opened this issue · comments
tokenize_dataset_rows生成训练数据后
直接从datasets可以输出input_ids和seq_len
dataset = datasets.load_from_disk(args.dataset_path)
print(f"\n{len(dataset)=}\n")
for key in dataset[0]:
print(key)
但是用data_collator读取数据时报错
File "/checkpoint/binary/train_package/finetune.py", line 125, in <module>
main()
File "/checkpoint/binary/train_package/finetune.py", line 118, in main
trainer.train()
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
data = self._next_data()
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch
return self.collate_fn(data)
File "/checkpoint/binary/train_package/finetune.py", line 28, in data_collator
seq_len = feature["seq_len"]
KeyError: 'seq_len'
是包的版本不对吗?还是有其他问题?
重新跑一遍tokenizer,不要用已有数据
#66 (comment)
python finetune.py --output_dir output --remove_unused_columns False 就可以了