Word-level AutoCompletion (WLAC)

This is a Shared task in WMT 2022. In this year, the shared task involves in two language pairs German-to-English (De-En) and Chinese-to-English (Zh-En) with two directions.

Steps

Download the datasets for De-En and Zh-EN (see the details in next section).

❗ ATTENTION❗ Participants must use only the data provided here.
Download the scripts in the directory scripts/ to preprocess the data.
Run the scripts to obtain the simulated training data for WLAC task from bilingual data.
❗❗❗Download the datasets for De-En and Zh-EN (see the details in next section)
Download the scripts in the directory scripts/ to preprocess the data.
Run the scripts to obtain the simulated training data for WLAC task from bilingual data.

Data Preparation

De-En Bilingual Data

The bilingual data is from WMT 14 preprocessed by Stanford NLP Group: train.de and train.en.

Zh-En Bilingual Data

The bilingual data is ``UN Parallel Corpus V1.0" from WMT 17. To obtain the data, one can follow three steps:

Download two files UNv1.0.en-zh.tar.gz.00 and UNv1.0.en-zh.tar.gz.01. You may also find both files yourself from webpage.
Run the following command to combine two files and decompress them:

cat UNv1.0.en-zh.tar.gz.* | tar -xzf -

en-zh/UNv1.0.en-zh.en and en-zh/UNv1.0.en-zh.zh are source and target files. Note that both files should be preprocessed (word segmentation for zh and tokenization for en) by scripts/preprocess.py.

Preparing the Simulated Training data for WLAC

Bilingual data can not be used to train WLAC models directly. Instead, one can obtain training data for WLAC from bilingual data via simulation following the reference [1] (See Section 3.2 in this paper). For example, this can be done by running the following cmd:

Data Preprocessing

Tokenization

Assuming that we want to do tokenization for data/zh-en .

At first, the folder should contain following files:

zh-en
├── train.en
└── train.zh

0 directories, 2 files

To tokenize, we should do following operations:

cd WLAC
Install jieba for Chinese tokenization:
```
pip install jieba
```

Download mosesdecoder for English/German tokenization:

git clone https://github.com/moses-smt/mosesdecoder.git

run python data/tokenize.py --source-lang zh --target-lang en --file-prefix data/zh-en/train

After above operations, we get two more tokenized files:

zh-en
├── train.en
├── train.en.tok
├── train.zh
└── train.zh.tok

0 directories, 4 files

For more details about tokenization, you can check data/tokenize.py and data/run_mosesdecoder.sh.

Generate Samples

After tokenization, if we want to generate samples for data/zh-en .

Just run following commands:

Install pypinyin:
```
pip install pypinyin
```

Run generate_samples.py:

python data/generate_samples.py --source-lang zh --target-lang en --file-prefix data/zh-en/train

After above command, we get one more file in folder zh-en:

zh-en
├── train.en
├── train.en.tok
├── train.samples
├── train.zh
└── train.zh.tok

0 directories, 5 files

for more details, you can check the data/generate_samples.py

Generate Samples

After tokenization, if we want to generate samples for data/zh-en .

Just run following commands:

Install pypinyin:
```
pip install pypinyin
```

Run generate_samples.py:

python data/generate_samples.py --source-lang zh --target-lang en --file-prefix data/zh-en/UNv1.0.en-zh.tok

After above command, we get one more file in folder zh-en:

for more details, you can check the data/generate_samples.py

zh-en
├── UNv1.0.en
├── train.en.tok
├── train.samples
├── train.zh
└── train.zh.tok

0 directories, 5 files

yc1999 / WLAC

Word-level AutoCompletion (WLAC)

Steps

Data Preparation

De-En Bilingual Data

Zh-En Bilingual Data

Preparing the Simulated Training data for WLAC

Data Preprocessing

Tokenization

Generate Samples

Generate Samples

About

Languages