Srijith-rkr / Whispering-LLaMA

EMNLP 23 - Integrating Whisper Encoder to LLaMA Decoder for Generative ASR Error Correction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How do we get tokenizer_model

yangyyt opened this issue · comments

How do we get this tokenizer_model and prepare data?

I uploaded the tokenizer_model here: https://huggingface.co/Srijith-rkr/Whispering-LLaMA/tree/main

I have also added the Alpaca model weights in the repo. Once you download it, you can merge them together for the LLM weights

Something like :
a = torch.load(alpaca_a.pth)
b = torch.load(alpaca_b.pth)
c = torch.load(alpaca_c.pth)
lit_lamma.pth = a | b | c # merging these for the final checkpoint
torch.save(lat_llama.pth, "[Mention path to Dir]")

You can also check out the notebooks at https://github.com/Srijith-rkr/Whispering-LLaMA/tree/main/data_preparation to figure out how to prepare your custom dataset.