wenet-e2e / WeTextProcessing.deprecated

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[RoadMap] development plan for Chinese inverse text normalization

xingchensong opened this issue · comments

Project Explanations:

image

  • Following NeMo's (1) classification + (2) verbalization two-stage method, we plan to adapt jiayu's ITN grammar to this two-stage pipeline (for more details, plz see this paper).

image

  • The reasons why we choose to separate Chinese ITN into two stages (each stage has its own WFST) rather than transduce input text using a single WFST:

    1. WFSTs can only process input linearly, but the word order can change from spoken to written form (i.e. 三分之一 -> 1/3)
    2. English ITN grammars, which has been carefully designed in NeMo, can be seamlessly integrated into this project

Road Map:

  • Design semiotic-class for Chinese
  • Update Chinese ITN grammars from single-stage to two-stage
  • Simplify ITN related code of Sparrowhawk(C++) and migrate it to WeNet runtime

危楼高百尺,手可摘星辰。不敢高声语,恐惊天上人。
Seems great, I will learn the basic ideas at first.

semiotic classes:

category sub-category example
number int 三十一 ==> 31
float 三十一点五七一 ==> 31.571
serial 一一一二二二三三三 ==> 111222333
telephone 加八六一八五四四一三九一二一 ==> +86-18544139121
- - -
electronic IP 二幺九点二二三点幺八四点二五二 ==> 219.223.184.252
email xyx艾特gmail点com ==> xyz@gmail.com
url xyx点com ==> xyz.com
- - -
fraction fraction 三分之一点二 ==> 1.2/3
- - -
percent percent 百分之二点五 ==> 2.5%
- - -
measure measure 五点五美元 ==> 5.5$
- - -
date date 二零二一年三月四日 ==> 2021年3月4日
- - -
time time 下午三点十五分 ==> 3:15 pm