ju-resplande / PLUE

Portuguese translation of the GLUE benchmark and Scitail dataset

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


PLUE: Portuguese Language Understanding Evaluation

https://fairytail.fandom.com/wiki/Plue
GitHub release (latest by date) GitHub GitHub Repo stars

Portuguese translation of the GLUE benchmark, SNLI, and Scitail
using OPUS-MT model and Google Cloud Translation.

Getting Started

Datasets Translation Tool
CoLA, MRPC, RTE, SST-2, STS-B, and WNLI Google Cloud Translation
SNLI, MNLI, QNLI, QQP, and SciTail OPUS-MT

Usage

Datasets πŸ€—

from datasets import load_dataset

data = load_dataset("dlb/plue", "cola")
# ['cola', 'sst2', 'mrpc', 'qqp_v2', 'stsb', 'snli', 'mnli', 'mnli_mismatched', 'mnli_matched', 'qnli', 'qnli_v2', 'rte', 'wnli', 'scitail']

Manual download (for large files)

Larger files are not hosted on github repository.

Structure

β”œβ”€β”€ code ____________ # translation code and dependency parsing  
β”œβ”€β”€ datasets
β”‚   β”œβ”€β”€ CoLA
β”‚   β”œβ”€β”€ MNLI
β”‚   β”œβ”€β”€ MRPC
β”‚   β”œβ”€β”€ QNLI
β”‚   β”œβ”€β”€ QNLI_v2
β”‚   β”œβ”€β”€ QQP_v2
β”‚   β”œβ”€β”€ RTE
β”‚   β”œβ”€β”€ SciTail
β”‚   β”‚   └── tsv_format
β”‚   β”œβ”€β”€ SNLI
β”‚   β”œβ”€β”€ SST-2
β”‚   β”œβ”€β”€ STS-B
β”‚   └── WNLI
└── pairs ____________ # translation pairs as JSON dictionary

Observations

  • GLUE provides two versions: first and second. We noticed the versions only differs in QNLI and QQP datasets, where we made QNLI available in both versions and QQP in the newest version.
  • LX parser, Binarizer code and NLTK word tokenizer were used to create dependency parsings for SNLI and MNLI datasets.
  • SNLI train split is a ragged matrix, so we made available two version of the data: train_raw.tsv contains irregular lines and train.tsv excludes those lines.
  • Manual translation were made on 12 sentences due to translation errors.
  • Our translation code is outdated. We recommend using from others.

Citing

@misc{Gomes2020,
  author = {GOMES, J. R. S.},
  title = {PLUE: Portuguese Language Understanding Evaluation},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ju-resplande/PLUE}},
  commit = {e7d01cb17173fe54deddd421dd735920964eb26f}
}

Acknowledgments

  • Deep Learning Brasil/CEIA
  • Cyberlabs

About

Portuguese translation of the GLUE benchmark and Scitail dataset

License:GNU General Public License v3.0


Languages

Language:Python 100.0%