[RoadMap] development plan for Chinese inverse text normalization
xingchensong opened this issue · comments
Xingchen Song(宋星辰) commented
Project Explanations:
- Following NeMo's
(1) classification
+(2) verbalization
two-stage method, we plan to adapt jiayu's ITN grammar to this two-stage pipeline (for more details, plz see this paper).
-
The reasons why we choose to separate Chinese ITN into two stages (each stage has its own WFST) rather than transduce input text using a single WFST:
- WFSTs can only process input linearly, but the word order can change from spoken to written form (i.e. 三分之一 -> 1/3)
- English ITN grammars, which has been carefully designed in NeMo, can be seamlessly integrated into this project
Xingchen Song(宋星辰) commented
Road Map:
- Design semiotic-class for Chinese
- Update Chinese ITN grammars from single-stage to two-stage
- Simplify ITN related code of Sparrowhawk(C++) and migrate it to WeNet runtime
Binbin Zhang commented
危楼高百尺,手可摘星辰。不敢高声语,恐惊天上人。
Seems great, I will learn the basic ideas at first.
Xingchen Song(宋星辰) commented
semiotic classes:
category | sub-category | example |
---|---|---|
number | int | 三十一 ==> 31 |
float | 三十一点五七一 ==> 31.571 | |
serial | 一一一二二二三三三 ==> 111222333 | |
telephone | 加八六一八五四四一三九一二一 ==> +86-18544139121 | |
- | - | - |
electronic | IP | 二幺九点二二三点幺八四点二五二 ==> 219.223.184.252 |
xyx艾特gmail点com ==> xyz@gmail.com | ||
url | xyx点com ==> xyz.com | |
- | - | - |
fraction | fraction | 三分之一点二 ==> 1.2/3 |
- | - | - |
percent | percent | 百分之二点五 ==> 2.5% |
- | - | - |
measure | measure | 五点五美元 ==> 5.5$ |
- | - | - |
date | date | 二零二一年三月四日 ==> 2021年3月4日 |
- | - | - |
time | time | 下午三点十五分 ==> 3:15 pm |