nlp-waseda / Kanbun-LM

Code for paper "Kanbun-LM: Reading and Translating Classical Chinese in Japanese Method by Language Models"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Kanbun-LM

This is the repository of our paper "Kanbun-LM: Reading and Translating Classical Chinese in Japanese Methods by Language Models". Our paper was accepted by Findings of ACL 2023, see you in Toronto!

[ACL] [arXiv] [GitHub] [demo]

Dataset

  • We introduce this dataset mainly in Section 3 "Our Dataset and Tasks".

  • There are three files for train, validation, and test. We split the dataset using group shuffle split to ensure that all sentences in one poem would not be split.

  • Each file contains 4 columns:

    • poetry_id: The ids of poem, each poem has multiple sentences.
    • hakubun: The original Classical Chinese sentences.
    • kakikudashi: The translated Kanbun sentences.
    • reading_order_ja: The Japanese reading orders of the original sentences (the numbers represent their index in the original text).

Code

  • We introduce our implementation mainly in Section 4.1 "Implementation for Tasks".

  • There are three folders.

    • baseline is the implementation for baseline UD-Kundoku. Please check the original repository for more details: https://github.com/KoichiYasuoka/UD-Kundoku.
    • sort is the implementation for the character reordering task.
      • Grid search details could be found in sort/run.sh.
    • generation is the implementation for the machine translation task.
      • T5 and GPT do not share codes, please check generation/t5 and generation/gpt separately for more details.
      • Grid search details could be found in generation/t5/run.sh and generation/gpt/run.sh.
      • The pipeline was implemented by --sort_hakubun option. Use --sort_with label to do pre-reorder by gold labels, use --sort_with prediction to do pre-reorder by prediction results.

Citation

@inproceedings{wang-etal-2023-kanbun,
    title = "Kanbun-{LM}: Reading and Translating Classical {C}hinese in {J}apanese Methods by Language Models",
    author = "Wang, Hao  and
      Shimizu, Hirofumi  and
      Kawahara, Daisuke",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.545",
    pages = "8589--8601",
}

About

Code for paper "Kanbun-LM: Reading and Translating Classical Chinese in Japanese Method by Language Models"


Languages

Language:Python 88.7%Language:Shell 11.3%