mt-upc / OccGen_dataset

OCCGEN dataset and its metadata

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OccGen_dataset

OccGen dataset and its metadata, we have two usecases for our toolkit. The translations_HR_data.csv which represents the parallel data of EN, ES, AR, RU. The translations_LR_data.csv which represents the parallel data of EN, SW.

All the previous data is annotated and is ready to be used for evaluation.

The OccGen toolkit is available in this github link: https://github.com/mt-upc/OccGen_toolkit/

Training data is released under the following links:

En-Ar 122k: https://drive.google.com/file/d/18SP18I-9wb0H_ibrdP9eBc3hVfgk4q9X/view?usp=sharing

En-Es 1M: https://drive.google.com/file/d/1wahfqTiEgD89wHqnZuq2X4syDk15EgTz/view?usp=sharing

En-Ru 433k:https://drive.google.com/file/d/16KnpUWtbCkoXOEIwUJKFuYqK8woQPZDg/view?usp=sharing

Gender distribution in our training data can be seen in the below heatmap:

heatmap_genders

Gender definitions extracted from Wikipedia.

About

OCCGEN dataset and its metadata