BriansIDP / WhisperBiasing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

what is f15?

c-arvind opened this issue · comments

While going through train.sh file I noticed a lot of referred files had _f15 like train_clean_100_f15.json or dev_f15.json and I didn't understand what this meant.

P.S also can you explain why train_clean_100_error.json in data/LibriSpeech has 'error' in the filename when there is no code that appends this keyword? Is it usable or not?

Hi. Thanks for the question.

f15 means the biasing list contain words less than 15 times in the training set. _error means this biasing list is obtained by finding word frequencies less than 15, as well as having high WER (calculated based on decoding the training set). This is referred to as the error-based biasing list.

Please let me know if that answers the question.

So basically you aim to introduce bias for increasing the model's accuracy by re-introducing words that have low frequency (less than 15 times) occurrence in your training data right? Please correct me if I'm wrong

This brings me to my next question, is it possible to introduce a list of words that are not even there in the training data? for example lets say my training dataset words like ['India', 'Canada', 'France'] which are are uttered in the audio clips however I also want the model to recognize ['Russia', 'Germany'] as they are also countries but they were never used in the training data. Since the weights of whisper are frozen, I can only hope to introduce these words to TCPGen in hopes that it corrects the output of whisper by learning from the external list I pass

Yes, that is one feature of TCPGen. The f15 list is used for training, and during inference, you can add any other words you want to the biasing list, including unseen words. In fact, during testing, I always include words that are unseen in training but exist in the test set. There is an analysis in the paper that shows the results of unseen words as well.

No I think you misunderstood my question. I'm asking if its possible to add words which do not have acoustic events or in other words which do not occur both in the train set as well as the test set. For example I have used Librispeech clean and test right but let's say I want the model to now interpret a word like 'sh*thousery' which exists in neither dataset; can I simply add this string into the biasing list and expect TCPGen to use it?

So do you mean even if that word is not in the utterance you want to transcribe, by adding a specific word into the biasing list, can Whisper transcribe the word as "sh*thousery" rather than the correct transcription?

pretty much. whisper has chances of getting the word wrong or transcribing something like 'sh*t house' so I would ideally want the addition of my new lexicon to be given preference over whisper's inference (or rather the tcpgen's network)