keyword spotting in call recording

Question

keyword spotting in call recording

dimanshu opened this issue 4 years ago · comments

can we spot keyword in call recording ? how much data do we need ?

Vineeth S · Answer 1 · Wed Sep 30 2020 19:14:01 GMT+0800 (China Standard Time)

Can we spot keyword in a call recording?

If you set up the proper data pipelines, yes. This is a real-time KWS and your call recording is no different from any continuous speech signal. So if you feed in your call data to the trained model it can spot the keyword.

How much data do we need?

That depends on a variety of factors like the keywords, how you pronounce the keywords etc. Right now, this model is trained on the Google Speech Commands dataset for the keyword 'Marvin'. The model is trained on 1700+ positive samples. If you have enough recordings for your keyword, you will be good to go. If you train your model on larger datasets such as Mozilla Common Voice, you can learn the acoustic embeddings which would allow you to have any arbitrary word as your keyword (words not in the dataset).

dimanshu · Answer 2 · Wed Sep 30 2020 19:42:32 GMT+0800 (China Standard Time)

i have data from real call data.write now I have 30 positive words .
can I make data by changing pitch and speed of 30 positive words ?
or i can create data through tts google API and real 30data ?
what would be the best way?
and yes for negative keywords i have 20,30k recordings so i can create enough data

Vineeth S · Answer 3 · Wed Sep 30 2020 20:00:46 GMT+0800 (China Standard Time)

By 30 positive words if you mean 30 different Keywords, I am not sure you will the expected performance. Normally KWS task is limited to a few keywords (say 1 - OK Google, Hey Siri etc). If you mean 30 positive samples of the same keyword, you can try, but I would suggest getting more data.

Can I make data by changing pitch and speed of 30 positive words?

Yes, you possibly can make such a dataset, but the question is what would you gain of it. You can augment the dataset with minor variations in pitch/speed but it is very unlikely that a person would speak at, say, a pitch of 500Hz. You should also consider the features you are planning to use for your model. If you use log-mel FB energies or MFCC, this augmentation will not help.

I can create data through TTS google API and real 30data?

I am not sure of this, but I am certain the variability in the data would be quite low.

What would be the best way?

The best way would ideally be to create a quality dataset for your keywords by recording/scrapping. You can also experiment with libraries like Gentle which annotates audio for you.

dimanshu · Answer 4 · Wed Sep 30 2020 20:04:21 GMT+0800 (China Standard Time)

okay, thanks but can i use your model to find 50 different keyword in call a recording like feedback, rating foul language in a call?
i will collect all the data for these 50 keywords will that work ?

Vineeth S · Answer 5 · Wed Sep 30 2020 20:11:32 GMT+0800 (China Standard Time)

I think it should work.

The way I would try would be to build a generic feature extractor (similar to the CNN model in this repo) which should contain positive as well as negative samples - the more the better. On top of that, I would create an ensemble model with each unit trying to find the presence of a single word. Comparing the outputs of each model we could detect the presence of a particular keyword if any.

I cannot guarantee that this would work, but this would be a good approach.

Vineeth S · Answer 6 · Fri Oct 09 2020 12:47:15 GMT+0800 (China Standard Time)

Closing issue due to inactivity. Please reopen the issue if necessary.

dimanshu · Answer 7 · Fri Oct 09 2020 14:23:48 GMT+0800 (China Standard Time)

@vineeths96 how to check the audio through this model .? i don't want to stream.

what should be the length of the audio file and sample rate ?