castorini / howl

Wake word detection modeling toolkit for Firefox Voice, supporting open datasets like Speech Commands and Common Voice.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Howl for different languages

codeghees opened this issue · comments

I am currently building a pipeline for a research project which requires KWS - I am confused which one would be better off.

In our use-case, we want to identify key-words over streams of audio data and not in wake word setting. Can I use Howl for that purpose?
The model will be served via an API and since it is supervised learning - we want to readily be able to add newer words overtime as well.

I think howl should be sufficient for that.

honk was mainly aiming for the keywords classification while howl supports keyword spotting over streams of audio with extra inference (filtering) mechanism

Thank you. What steps would I need to change in case of Urdu keywords (our local language)

hrm that's an interesting direction.

adding a new dataset to the system can be achieved with a similar change in https://github.com/castorini/howl/pull/31/files

However, I don't think different language is something supported by howl.
The main limitation is coming from missing frame level transcription.

@daemon do you know how one can support other language?

@Ijj7975 we can generate our own pronunciation dictionary using a method we developed in our lab. Would that help?

I am not that familiar with how MFA aligner actually works in such cases.
This woould be something that you will need to dig into.

as long as you can generate data of the right format and corresponding frame level. I don't see why not

Hi, I think I was able to figure out MFA for Urdu. How do I go about supporting it?
@ljj7975
Any help is appreciated.

https://github.com/castorini/howl#preparing-a-dataset

It supports only one word - how do I support multiple?

As instructed in the read me, you will first need to preprocess your raw datasets using create_raw_dataset.
you should generate one for positive audios and one for negative audios.
Depends on how your raw dataset is structured, you might need to modify some files (just like the change in #31)

Then using mfa with the Urdu dict, you can align the dataset to get the right datasets for howl.

The instruction just show one keyword but it works for many keywords. just specify VOCAB='["fire"]' INFERENCE_SEQUENCE=[0] accordingly