nlp natural-language-processing embeddings fasttext-embeddings spanish spanish-language

Spanish Word Embeddings

Spanish words embeddings computed using fastText on the Spanish Unannotated Corpora.

Pre-Processing

The data used was already preprocessed in Spanish Unannotated Corpora to lowercase, remove multiple spaces, remove urls and others. We also used the script to split on punctuation included in the previous repository.

According to that tokenization, the 2.6B words corpus got into 3.4B tokens.

For new L we used the updated version of Spanish Unannotated Corpora which has 3B words and applied same preprocessing of the other models.

fastText Parameters

We set default parameters of fastText for Skipgram task except for epochs were we set 20 instead of 5.

Evaluation

We evaluated our word embeddings in SemEval-2017 Task 2 (Subtask 1) using the script provided by MUSE library, getting these results:

	XS	S	M	L	new L
Score	0.59150	0.67589	0.72345	*0.74676*	0.72940

Being L embedding model the best one in Spanish as far as we know in the date of publication.

Download

XS (word vectors=1313423, dim=10): model.bin, model.vec
S (word vectors=1313423, dim=30): model.bin, model.vec
M (word vectors=1313423, dim=100): model.bin, model.vec
L (word vectors=1313423, dim=300):model.bin, model.vec
new L (word vectors=1451827, dim=300):model.bin, model.vec

Reference

Enriching Word Vectors with Subword Information

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

About

Spanish Word Embeddings computed from large corpora and different sizes using fastText.

nlp natural-language-processing embeddings fasttext-embeddings spanish spanish-language

MIT License