Probability of predicting the Hot word

Question

Probability of predicting the Hot word

thatsri9ht opened this issue 5 months ago · comments

To activate a wake word, which ultimately has a probability between zero and one, we typically determine this probability or score through a model. For instance, for wake words like "Alexa," as far as I understand, it becomes active if it detects a certain number of phonemes or more, regardless of whether the word "Alexa" is fully articulated.

For example, "Hey Alexa" activates the system, regardless of whether the word "Alexa" is fully pronounced.

Is there a solution so that the system becomes active only after the complete word is spoken? For example, "Hey-Mycroft" activates the system, while "Hey-Mycr" does.
Is the average probability of several consecutive frames calculated?

dscripka · Answer 1 · Sat May 25 2024 09:36:51 GMT+0800 (China Standard Time)

The models predict on an entire sequence of frames, depending on the size of the input when the model is trained. Typically this sequence represents between 1 and 1.5 seconds, and a single score is returned for the entire sequence, so individual phonemes are not capture and predicted. This entire window then advances by a fixed amount (typically 80 ms) and the score for the sequence is predicted again.

This does mean it can be difficult to prevent the model from activating on only a partial word. However, you can add examples of these partial words as adversarial negative data (with a custom training config), which can help reduce this behavior.

thatsri9ht · Answer 2 · Sat May 25 2024 20:15:43 GMT+0800 (China Standard Time)

Thank you:X