Extraction of Cross-Lingual Models from Monolingual Corpora

Requirements:

Perl and Bash interperters

Description

This system takes as entry two plain text files in two languages (non-parallel corpora) and a small bilingual dictionary with 5k entries, so as to build a cross-lingual model based on syntactic dependencies. The dependency-based model is transparent, is stored in the folder freq, and is evaluated using a test dictionary. The model can be used to induce new bilingual pairs. Notice that the syntactic parser can take more than 24 hours in large documents (1G or more). Syntactic parsing is carried out with Linguakit. A simple version of Linguakit, just including PoS tagging and parsing for English, Spanish, Portuguese and Galician is also included in this repository. If you wish to install the full tool, go to its own github repository.

How to use

Build a new model

You can build (and evaluate) an English-Spanish model from raw texts using the script:

sh Build_model.sh

You just need two files with raw text in the corpus folder. The file names can be: corpus-en.txt.gz and corpus-es.txt.gz.

If you want to use other train dictionaries and other languages, copy the new dictionary into the dico folder with the appropriate format and uncomment the line sh run_seedTemplates.sh in order to create new seed bilingual templates.

Download and evaluate a pre-trained model

To evalute an existing model pre-trained from English and Spanish wikipedias, you can use the following script:

sh Eval.sh

This script downloads a large pre-trained model and uses a test dictionary to evaluate it. The evaluation scrip only considers the words including in the test dictionary, so it is not a real evaluation.

How to cite

This system participated at the Cross-Lingual task of SemEval 2017 achieving the best results among the systems that only used Wikipedia corpus as train resource:

Gamallo, Pablo (2017). Citius at SemEval-2017 Task 2: Cross-Lingual Similarity from Comparable Corpora and Dependency-Based Contexts, Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), at ACL 2017, pp. 226-229, Vancouver, Canada. ISBN 978-1-945626-00-5.

Download the paper

About

Method to build transparent cross-lingual models from monolingual corpora

GNU General Public License v3.0

Languages

Language:Perl 97.9%Language:Raku 1.6%Language:Shell 0.5%Language:Awk 0.0%Language:Batchfile 0.0%