Use CTC beam search decoder with subword encoding.

Question

Use CTC beam search decoder with subword encoding.

DomainFlag opened this issue 2 years ago · comments

I'm using the scorer generator provided generate_scorer_package. I'm also using (e.g., SentencePiece) to build a unigram language model, where the decoder predicts the size of the language model. How can I adapt the scorer such that it supports sub-word units? Will scorer work if filling the alphabet file with the sub-word units? Or shall I rely on some tricks like encoding the unigram language model using an ASCII table and re-encoding the corpus and use the alphabet based on the previous encoding mapping? Thank you.

Marius Hucker · Answer 1 · Wed Jul 10 2024 17:29:03 GMT+0800 (China Standard Time)

Have you ever solved this? Is there a way to use subword encoding?