Srijith-rkr / Whispering-LLaMA

EMNLP 23 - Integrating Whisper Encoder to LLaMA Decoder for Generative ASR Error Correction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dataset Issue

heyLQ opened this issue · comments

Could you please provide "/ibex/user/radhaks/LLMs/LLaMA_7B/LLAMA_EMNLP_DeepSpeed/dataset/inferences/gigaspeech_TRAIN.json", second line in "To generate audio features", thanks!

Hey there,

Sorry for the delay in reply. I have attached the hypotheses dataset we used for the paper in HuggingFace here. https://huggingface.co/datasets/PeacefulData/HyPoradise-v1-GigaSpeech. The generate audio features notebook just adds the audio features (from the Whisper encoder) to this json/csv file and saves it to a .pt checkpoint. You would need the path to the audio file to generate audio features and you can map it using ID tags in HuggingFace datasets.

You can also generate your own json file for your custom dataset by following https://github.com/Srijith-rkr/Whispering-LLaMA/blob/main/data_preparation/To%20generate%20n-best%20hypothesis.ipynb notebook.