snakers4 / russian_stt_text_normalization

Russian text normalization pipeline for speech-to-text and other applications based on tagging s2s networks

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What the best practises of using this lib for stt?

Alex-Kopylov opened this issue · comments

commented

I have zero experience in making STT models so please, advise me.

I'm using your open_stt (thanks!) with SeanNaren/deepspeech.pytorch for building STT model. So as you know, I must provide labels for training.

What the intuition behind using string.punctuation and uppercase or lowercase at the same time? Should I provide this(below) as labels or left only space and chars (e.g. lowercase)?

# punctuation + space + rus
self.tgt_vocab = {token: i+5 for i, token in enumerate(punctuation + rus_letters + ' ' + '«»—')}