McSinyx / viwikipi

Vietnamese Wikipedia Paraphase Identity experiments

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

viwikipi

Vietnamese Wikipedia Paraphase Identity experiments on Transformers

Prerequisites

Git LFS is required to clone this repository. After installation (the plugin is available on most GNU/Linux distributions), set it up using

git lfs install

Next, clone this repo and install the dependencies:

git clone https://github.com/McSinyx/viwikipi.git
cd viwikipi/
pip3 install -r requirements.txt

Usage

At the moment, GLUE for Bert can be fine-tuned using

tools/glue --model-type=bert --model=bert-base-multilingual-cased \
           --output-dir=bert --log-file=bert/$(date -Is).log

Filtering for highly lossed example may assist training, which can be done by

tools/filter -t bert -m bert mrpc/train.tsv train.tsv

Paraphase identity labeling can be achieved via

tools/label -t bert -m bert tests/test.json tests/submission.csv

XLM models can be used similarly by replacing bert* with xlm*.

Copying

Training data (mrpc, tests) are provided by Zalo AI Challenge 2019. These are derivatives of texts from Wikipedia, which are licensed under CC BY-SA 3.0.

Vietnamese WordNet (wn) is taken from zeloru/vietnamese-wordnet, which is a derivative of Wiktionary, which is also licensed under CC BY-SA 3.0.

For consistency, the two resources above and their modifications are released under CC BY-SA 4.0.

tools/glue is basically ripped-off from transformers with saner defaults and GNU-style long arguments. The original version is licenced under Apache-2.0.

The entire codebase, including this script, is then released under GNU AGPLv3.

About

Vietnamese Wikipedia Paraphase Identity experiments

License:GNU Affero General Public License v3.0


Languages

Language:Python 94.9%Language:Jupyter Notebook 5.1%