ZurichNLP / swiss-german-text-encoders

Code for the paper "Modular Adaptation of Multilingual Encoders to Written Swiss German Dialect"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Code for the paper "Modular Adaptation of Multilingual Encoders to Written Swiss German Dialect"

Blog post

List of models released for this paper:

Installation

  • Requirements: Python >= 3.8, PyTorch
  • pip install -r requirements.txt

Continued Pre-training

Data

  • Not all the data we used are publicly available. See data/README.md for details.
  • python -m scripts.preprocess_continued_pretraining_data

Training

  • Subword level: python -m scripts.continued_pretraining_subword <model_name_or_path>
    • Tested with xlm-roberta-base, facebook/xmod-base, ZurichNLP/swissbert
  • Character level: python -m scripts.continued_pretraining_char <model_name_or_path>
    • Tested with google/canine-s, facebook/xmod-base, ZurichNLP/swissbert (the latter two correspond to the GLOBI approach described in Section 4.3 of the paper)

Evaluation

Data

  • See data/README.md for instructions on how to download the data.

Fine-tuning and testing

  • Part-of-speech tagging: python -m scripts.evaluate_pos <model_name_or_path>
  • German dialect identification: python -m scripts.evaluate_gdi <model_name_or_path>
  • Retrieval (no fine-tuning): python -m scripts.evaluate_retrieval <model_name_or_path>

License

  • This code repository: MIT license
  • Model weights: Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)

Citation

@inproceedings{vamvas-etal-2024-modular,
    title = "Modular Adaptation of Multilingual Encoders to Written {S}wiss {G}erman Dialect",
    author = {Vamvas, Jannis  and
      Aepli, No{\"e}mi  and
      Sennrich, Rico},
    editor = {V{\'a}zquez, Ra{\'u}l  and
      Mickus, Timothee  and
      Tiedemann, J{\"o}rg  and
      Vuli{\'c}, Ivan  and
      {\"U}st{\"u}n, Ahmet},
    booktitle = "Proceedings of the 1st Workshop on Modular and Open Multilingual NLP (MOOMIN 2024)",
    month = mar,
    year = "2024",
    address = "St Julians, Malta",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.moomin-1.3",
    pages = "16--23"
}

About

Code for the paper "Modular Adaptation of Multilingual Encoders to Written Swiss German Dialect"

License:MIT License


Languages

Language:Python 97.8%Language:Jupyter Notebook 2.2%