MulDA

This repository contains the source code and data used in our paper "MulDA: A Multilingual Data Augmentation Framework for Low-Resource Cross-Lingual NER" accepted by ACL-IJCNLP 2021.

Data

The data generated using our labeled sequence translation method can be found in the "data" directory.

Labled Sequence Translation

cd code/translate; python translate.py

lstm-lm: multiilngual LSTM language model

train lstm-lm on linearized sequences

cd code/lstm-lm;

python train.py \
  --train_file PATH/TO/train.linearized.txt \
  --valid_file PATH/TO/dev.linearized.txt \
  --model_file PATH/TO/model.pt \
  --emb_dim 300 \
  --rnn_size 512 \
  --gpuid 0

generate linearized sequences

cd code/lstm-lm;

python generate.py \
  --model_file PATH/TO/model.pt \
  --out_file PATH/TO/out.txt \
  --num_sentences 10000 \
  --temperature 1.0 \
  --seed 3435 \
  --max_sent_length 32 \
  --gpuid 0

tools: tools for data processing

tools/preprocess.py: sequence linearization
tools/line2cols.py: convert linearized sequence back to two-column format

Requirements

code/lstm-lm/requirements.txt

Citation

Please cite our paper if you found the resources in this repository useful.

@inproceedings{liu-etal-2021-mulda,
    title = "MulDA: A Multilingual Data Augmentation Framework for Low-Resource Cross-Lingual NER",
    author = "Liu Linlin  and
      Ding, Bosheng  and
      Bing, Lidong  and
      Joty, Shafiq  and
      Si, Luo  and
      Miao, Chunyan",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL'21)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
}

dadelani / mulda