chiennv2000 / DHGNet

Code for paper "Cross-lingual Transfer for Text Classification with Dictionary-based Heterogeneous Graph", EMNLP 2021 - findings.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DHGNet

Code repository for findings of EMNLP 2021 paper "Cross-lingual Transfer for Text Classification with Dictionary-based Heterogeneous Graph." [ACL] [arxiv]

Requirements

We tested the code on:

other requirements:

  • numpy
  • pandas
  • scikit-learn
  • gensim
  • tqdm
  • nltk
  • pythainlp 2.3.1 (for Thai language tokenizer)

Usage

  1. Extract data/text_cls.zip file for datasets.

  2. Run the code in src folder using the command for training and evaluating DHGNeten.

    • For Bosnian setting:
      python main.py bosnian --rnn_layers 1 --directed 0 --add_from_dict 30000 --name [output_model_name] .

    • For other settings (bengali,malayalam,tamil,thai_t,thai_w):
      python main.py [setting_name] --name [output_model_name] .

    • For DHGNetmulti, add a command option --langs ar,en,es,fa,fr,zh .

Note that the code will automatically download source word-embeddings (default fasttext) which may take time and disk space.
Optionally, you can download dump files that contain all related source word-embeddings for the aforementioned settings in https://1drv.ms/u/s!AkynV6rCKmmXkNBYwRchAWfurRkBrQ?e=NnOszA and put the files in folder data/word_emb/fasttext_wiki.
Then run the code with an additional command option --use_temp_only 1
** To run using only English as source, you can download only en.db_temp.pkl.

Reference

If you find the code helpful, please cite our work:

@inproceedings{chairatanakul-etal-2021-cross-lingual,
    title = "Cross-lingual Transfer for Text Classification with Dictionary-based Heterogeneous Graph",
    author = "Chairatanakul, Nuttapong  and
      Sriwatanasakdi, Noppayut  and
      Charoenphakdee, Nontawat  and
      Liu, Xin  and
      Murata, Tsuyoshi",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.130",
    pages = "1504--1517",
}

About

Code for paper "Cross-lingual Transfer for Text Classification with Dictionary-based Heterogeneous Graph", EMNLP 2021 - findings.

License:MIT License


Languages

Language:Python 100.0%