Dataset Issue

Question

Dataset Issue

heyLQ opened this issue 6 months ago · comments

Could you please provide "/ibex/user/radhaks/LLMs/LLaMA_7B/LLAMA_EMNLP_DeepSpeed/dataset/inferences/gigaspeech_TRAIN.json", second line in "To generate audio features", thanks!

Srijith-rkr · Answer 1 · Fri Jan 26 2024 01:34:04 GMT+0800 (China Standard Time)

Hey there,

Sorry for the delay in reply. I have attached the hypotheses dataset we used for the paper in HuggingFace here. https://huggingface.co/datasets/PeacefulData/HyPoradise-v1-GigaSpeech. The generate audio features notebook just adds the audio features (from the Whisper encoder) to this json/csv file and saves it to a .pt checkpoint. You would need the path to the audio file to generate audio features and you can map it using ID tags in HuggingFace datasets.

You can also generate your own json file for your custom dataset by following https://github.com/Srijith-rkr/Whispering-LLaMA/blob/main/data_preparation/To%20generate%20n-best%20hypothesis.ipynb notebook.