This repository contains
- Experimental scripts for investigating an effect of normalizing Japanese orthographical variants
- KFTT dataset for investigating an effect of normalizing Japanese orthographical variants
- Preprocessing and Post-processsing scripts for cleaning KFTT and NTCIR-7 corpora.
Preprocessing scripts for SMT research. These scripts is apllied for a research(Japanese Orthographical Normalization Do Not Work for Statistical Machine Translation).
These scripts make a parallel corpus clear for Statistical Machine Translation research. Target corpora are KFTT(The Kyoto Free Translation Task) and NTCIR-7(Pattent corpus).
git clone https://github.com/kanjirz50/mt_ialp2016.git
cd mt_ialp2016
./make_corpus.sh
# A lot of files will be downloaded.
Moses is required to run a training scripts. Before runnnig script, please install Moses and check the path.
# after running "make_corpus.sh"
cd ex_script
bash ex_baseline_E2J.sh /tools/src/mosesdecoder 4 8
# arg 1: Moses directory
# arg 2: Theread number
# arg 3: Memory(GB)
# Training and translation will start
Then, work
directory will be created.
A translation system and a translated text will be created under the directory.
script/ja_prepro.pl
deletes Japanese name era and yomigana.
(Yomigana is Japanese writing system for showing how to pronounce a word.)
script/en_prepro/pl
lowercases all alphabetic characters.
script/ja_postpro.pl
is a post processing-script for chunking numeric representations.
Because Japanese morphological analyzer, MeCab-UniDic segments numeric representations.
In English side, there is no boundary within a numeric representation.
ex. 私は25歳です。 (I am 25 years old.) 私 は 2 5 歳 です。
cat word_segmented_japanese_text.ja | perl script/ja_postpro.pl
- Graham Neubig, "The Kyoto Free Translation Task," http://www.phontron.com/kftt, 2011.
- Kazuhide Yamamoto and Kanji Takahashi. Japanese Orthographical Normalization Do Not Work for Statistical Machine Translation. Proceedings of the International Conference on Asian Language Processing (IALP 2016), Nov, 2016
All scripts: MIT KFTT corpus: Creative Commons Attribution-Share-Alike License 3.0