junailin / OpenIME

Open Vocabulary Learning for Neural Chinese Pinyin IME

Home Page:https://arxiv.org/pdf/1811.04352.pdf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dataset and codes accompanying the paper Open Vocabulary Learning for Neural Chinese Pinyin IME.

Dataset

Two processed corpora for IME evaluation, the People’s Daily corpus (PD) and the TouchPal corpus (TP) .

Chinese Pinyin
PD MIUs 5.04M
Word 24.7M 24.7M
Vocab 54.3K 41.1K
Target Vocab (train) 2309 -
Target Vocab (dec) 2168 -
TP MIUs 689.6K
Word 4.1M 4.1M
Vocab 27.2K 20.2K
Target Vocab (train) 2020 -
Target Vocab (dec) 2009 -

.ali target

.py source

.adddict training set

.test2k test set

The full corpus and pre-trained vectors can be downloaded from https://drive.google.com/drive/folders/1v6QW7ULu-iYxU5uruiuSgYGmoXOcHAeX?usp=sharing

Source Code

We also release our source codes to help others reproduce our result, which is modified from OpenNMT with similar usage.

Reference

If you use this repo please cite our paper:

@inproceedings{zhang2019acl-ime,
	title = "{Open Vocabulary Learning for Neural Chinese Pinyin IME}",
	author = "Zhang, Zhuosheng and Huang, Yafang and Zhao, Hai",
	booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)",
	year = "2019",
}

About

Open Vocabulary Learning for Neural Chinese Pinyin IME

https://arxiv.org/pdf/1811.04352.pdf


Languages

Language:Lua 97.6%Language:Python 1.5%Language:Perl 0.6%Language:Shell 0.3%