magic282/Automatic-Corpus-Generation

A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Checking (EMNLP2018)

This repository contains the scripts which can be used to automatically generate sentences with errors, of which locations and the corresponding corrections can be easily marked without any human intervention. A generated Dataset containing 271,329 sentences, with the min_length=4, max_length=140, average_length=42.5, total_error=381,962, average_error=1.4, and Confusionset are also provided for future research on Chinese Spelling Checking.

Note: The Dataset and Confusionset will be continuously updated.

Main Libraries

pytesseract
OpenCV
Kaldi
Python 35
Pytorch 0.4
numpy
BeautifulSoup

OCR-based Method

ASR-based Method

Basic Model

After generating the dataset using our proposed method, you can try any model you wanna on CSC. Here, we implement a pytorch-based bilstm model, in which lots of details can be furture optimized.

For training, use the command line python main_train.py. Training details will be printed on the screen.
For test, use the command line python main_test.py.

Note: You can fine-tune the hyper-parameters or add more generated data to imrprove the model performance.

Confusionset

For a given word, a confusionset refers to a set of words that are visually or phonologically similar with the given word. For example, 哨:宵诮梢捎俏咪尚悄少销消硝赵逍屑吵噹躺稍峭鞘肖. As a "byproduct" of our proposed method, we construct a confusion set for all involved correct characters by collecting all incorrect variants for each correct character, which is widely used in the task of CSC. We also open this confusionset for future research on CSC.

Testing Datasets

SIGHAN Bake-off 2013: http://ir.itc.ntnu.edu.tw/lre/sighan7csc.html

SIGHAN Bake-off 2014 : http://ir.itc.ntnu.edu.tw/lre/clp14csc.html

SIGHAN Bake-off 2015 : http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html

Note: All datasets above are originally written in Traditional Chinese. Considering the fact that our generated datasets are in Simplified Chinese, we have translated the original datasets into a version of Simplified Chinese, which can be found in the Data folder. The tool we use to translate Tranditional Chinese to Simplified Chinese is OpenCC.

Citation

If you find the implementation useful, please cite the following paper: A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check

@InProceedings{Reimers:2018:EMNLP,
  author    = {DingminWang, Yan Song, Jing Li, Jialong Han, Haisong Zhang},
  title     = {{A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check}},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  month     = {11},
  year      = {2018},
  address   = {Brussels, Belgium},
}

Contact

Drop me (Dingmin Wang) an email at wangdimmy (AT) gmail.com if you have any question.

magic282 / Automatic-Corpus-Generation