Codes and corpora for paper "Dual Long Short-Term Memory Networks for Sub-Character Representation Learning" (accepted at ITNG 2018).
We proposed to learn character and sub-character level representations jointly for capturing deeper level of semantic meanings. When applied to Chinese Word Segmentation as a case example, our solution achieved state-of-the-art results on both Simplified and Traditional Chinese, without extra Traditional to Simplified Chinese conversion.
- Python >= 3
- DyNet==2.0.1
Simply run one command:
./script/run.sh pku 1
It does everything for you on the fly, including data preparation, training and test.
- You can replace
pku
withmsr
,cityu
andas
. - The second parameter indicates model options from
1
to6
, details are listed in the next chapter.
We have presented 6
models in our paper. Their configurations are shown in following table:
#. model |
char | subchar | radical | tie | bigram |
---|---|---|---|---|---|
1. baseline |
YES | ||||
2. +subchar |
YES | ||||
3. +radical |
YES | YES | |||
4. +radical -char |
YES | ||||
5. +radical +tie |
YES | YES | YES | ||
6. +radical +tie +bigram |
YES | YES | YES | YES |
- Thanks for those friends who helped us with the experiments.
- Corpora are from SIGHAN05, which should only be used for research purposes.
- Model implementation modified from a Dynet-1.x version by rguthrie3.