huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unable to set concatenate_audio parameter to False in run_pseudo_labelling.py

lq0104 opened this issue · comments

It seems that there might be a bug when setting the concatenate_audio parameter to False in run_pseudo_labelling.py. When attempting to do so, it results in an error.

06/03/2024 04:52:57 - INFO - main -
Traceback (most recent call last):
File "/home/code/distil-whisper/training/run_pseudo_labelling.py", line 1040, in
main()
File "/home/code/distil-whisper/training/run_pseudo_labelling.py", line 1023, in main
eval_step_with_save(split=split)
File "/home/code/distil-whisper/training/run_pseudo_labelling.py", line 1006, in eval_step_with_save
raw_datasets[split] = raw_datasets[split].map(
File "/opt/conda/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3066, in map
raise ValueError(
ValueError: Input column ['condition_on_prev'] not in the dataset. Current columns in the dataset: ['id', 'path', 'audio', 'transcription', 'duration', 'language', 'original_speaker_id', 'session_id', 'topic', 'whisper_transcript', 'eval_preds']

Is there something I missed?