ChristineBasta / Gebiotoolkit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GebioToolkit

We present the GebioToolkit for extracting multilingual parallel corpora at sentence level, with document and gender information from Wikipedia biographies. Despite the gender inequalities present in Wikipedia, the toolkit has been designed to extract corpus balanced in gender. While our toolkit is able to customize the languages for which we are extracting the multilingual corpus, in this work, we extracted a corpus of 2000 sentences in English, Spanish and Catalan, which has been post-edited by native speakers to be valid as test dataset for machine translation.

Dependencies

Usage

Corpus extractor

Change domain

If we want to change the domain of gebiotoolkit, we need to generate new files and delete the ones from the folder dictionary_of_files. To generate new ones, we need to get a list of wikipedia entries, the easiest way to achieve this is by using the petscan tool (https://petscan.wmflabs.org/) in which allows you to perform it easily. Next we execute the following code:

python3 wp_api_language_search.py -csv new_list.csv 

References

For more information, please check https://arxiv.org/pdf/1912.04778.pdf

About

License:MIT License


Languages

Language:JavaScript 72.9%Language:Python 27.1%Language:Shell 0.0%