subs2vec

Van Paridon & Thompson (2019) introduces pretrained embeddings and precomputed word/bigram/trigram frequencies in 55 languages. The files can be downloaded from the links in this table. Word vectors trained on subtitles are available, as well as vectors trained on Wikipedia, and a combination of subtitles and Wikipedia (for best predictive performance).

This repository contains the subs2vec module, a number of Python 3.7 scripts and command line tools to evaluate a set of word vectors on semantic similarity, semantic and syntactic analogy, and lexical norm prediction tasks. In addition, the subs2vec.py script will take an OpenSubtitles archive or Wikipedia and go through all the steps to train a fastText model and produce word vectors as used in the paper associated with this repository.

Psycholinguists may be especially interested norms script, which evaluates the lexical norm prediction performance of a set of word vectors, but can also be used to predict lexical norms for un-normed words. For a more detailed explanation see the how to use -> extending lexical norms section.

The scripts in this repository require Python 3.7 and some additional libraries that are easily installed through pip. (If you want to use the subs2vec.py script to train your own word embeddings, you will also need compiled fastText and word2vec binaries.)

If you use any of the subs2vec code and/or pretrained models, please cite the preprint (Van Paridon & Thompson, 2019).

How to use

subs2vec is available through pip, installing is as easy as running:
python3 -m pip install subs2vec
Any missing dependencies should be installed automatically.

Each submodules of subs2vec can then be run as a command line tool using the -m flag:
python3 -m subs2vec.submodule_name

Evaluating word embeddings

To evaluate word embeddings on analogies, semantic similarity, or lexical norm prediction as in Van Paridon & Thompson (2019), use:
python3 -m subs2vec.analogies fr french_word_vectors.vec
python3 -m subs2vec.similarities fr french_word_vectors.vec
python3 -m subs2vec.norms fr french_word_vectors.vec
subs2vec uses the two-letter ISO language codes, so French in the example is fr, English would be en, German would be de, etc.

All datasets used for evaluation, including the lexical norms, are stored in subs2vec/evaluation/datasets/.
Results from Van Paridon & Thompson (2019) are in subs2vec/evaluation/article_results/.

Extending lexical norms

To extend lexical norms (either norms you have collected yourself, or norms provided in this repository) use:
python3 -m subs2vec.norms fr french_word_vectors.vec --extend_norms=french_norms_file.txt

The norms file should be a tab-separated text file, with the first line containing column names and the column containing the words should be called word. Unobserved cells should be left empty. If you are unsure how to generate this file, you can create your list in Excel and then use Save as... tab-delimited text.
For an overview of norms that come included in the repo (and their authors), see this list. For the norms datasets themselves, look inside this directory.

Extracting word frequencies

The subtitle corpus used to train subs2vec was also used to compile the word frequencies in SUBTLEX. That same corpus can of course be used to compile bigram and trigram frequencies as well.
To extract word, bigram, or trigram frequencies from a text file yourself, fr.txt for instance, use:
python3 -m subs2vec.frequencies fr.txt

In general, however, we recommend downloading the precompiled frequencies files from [language archive] and looking frequencies up in those.
When looking up frequencies for specific words, bigrams, or trigrams, you may find that you cannot open the frequencies files (they can be very large). To retrieve items of interest use:
python3 -m subs2vec.lookup frequencies_file.tsv list_of_items.txt
Your list of items should be a simple text file, with each item you want to look up on its own line.
This lookup scripts works for looking up frequencies, but it finds lines in any plain text file, so it works for looking up word vectors in .vec files as well.

Removing duplicate lines

subs2vec comes with a module that removes duplicate lines from text files. We used it to remove duplicate lines from training corpora, but it works for any text file.
To remove duplicates from fr.txt for example, use:
python3 -m subs2vec.deduplicate fr.txt

Training models

If you want to reproduce models as used in Van Paridon & Thompson (2019), you can use the train_model module.
For instance, the steps to create a subtitle corpus are:

Download a corpus:
python3 -m subs2vec.download fr subs
Clean the corpus:
python3 -m subs2vec.clean_subs fr --strip --join
Deduplicate the lines in the corpus:
python3 -m subs2vec.deduplicate fr.txt
Train a fastText model on the subtitle corpus:
python3 -m subs2vec.train_model fr subs dedup.fr.txt
This last step requires the binaries for fastText and word2phrase (part of word2vec) to be downloaded, built, and discoverable on your system (i.e., on your PATH).

For more detailed training options:
python3 -m subs2vec.train_model --help

API

For more detailed documentation of the package modules and API, see subs2vec.readthedocs.io

Downloading datasets

This table contains links to the top 1 million word vectors in each language, as well all vectors, model binaries, and the word, bigram, and trigram frequencies in the subtitle and Wikipedia corpora. If you use these pretrained vectors/models, please cite the preprint (Van Paridon & Thompson, 2019).

language	lang	corpus	vectors	corpus word count	ngram counts
Afrikaans	af	OpenSubtitles	top 1M vectors all vectors model binary	324K	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	17M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	17M
Arabic	ar	OpenSubtitles	top 1M vectors all vectors model binary	188M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	120M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	308M
Bulgarian	bg	OpenSubtitles	top 1M vectors all vectors model binary	247M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	53M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	300M
Bengali	bn	OpenSubtitles	top 1M vectors all vectors model binary	2M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	19M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	21M
Breton	br	OpenSubtitles	top 1M vectors all vectors model binary	111K	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	8M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	8M
Bosnian	bs	OpenSubtitles	top 1M vectors all vectors model binary	92M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	13M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	105M
Catalan	ca	OpenSubtitles	top 1M vectors all vectors model binary	3M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	176M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	179M
Czech	cs	OpenSubtitles	top 1M vectors all vectors model binary	249M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	100M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	349M
Danish	da	OpenSubtitles	top 1M vectors all vectors model binary	87M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	56M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	143M
German	de	OpenSubtitles	top 1M vectors all vectors model binary	139M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	976M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	1B
Greek	el	OpenSubtitles	top 1M vectors all vectors model binary	271M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	58M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	329M
English	en	OpenSubtitles	top 1M vectors all vectors model binary	751M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	2B	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	3B
Esperanto	eo	OpenSubtitles	top 1M vectors all vectors model binary	382K	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	38M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	38M
Spanish	es	OpenSubtitles	top 1M vectors all vectors model binary	514M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	586M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	1B
Estonian	et	OpenSubtitles	top 1M vectors all vectors model binary	60M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	29M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	90M
Basque	eu	OpenSubtitles	top 1M vectors all vectors model binary	3M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	20M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	24M
Farsi	fa	OpenSubtitles	top 1M vectors all vectors model binary	45M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	87M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	132M
Finnish	fi	OpenSubtitles	top 1M vectors all vectors model binary	117M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	74M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	191M
French	fr	OpenSubtitles	top 1M vectors all vectors model binary	336M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	724M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	1B
Galician	gl	OpenSubtitles	top 1M vectors all vectors model binary	2M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	40M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	42M
Hebrew	he	OpenSubtitles	top 1M vectors all vectors model binary	170M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	133M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	303M
Hindi	hi	OpenSubtitles	top 1M vectors all vectors model binary	660K	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	31M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	32M
Croatian	hr	OpenSubtitles	top 1M vectors all vectors model binary	242M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	43M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	285M
Hungarian	hu	OpenSubtitles	top 1M vectors all vectors model binary	228M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	121M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	349M
Armenian	hy	OpenSubtitles	top 1M vectors all vectors model binary	24K	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	38M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	39M
Indonesian	id	OpenSubtitles	top 1M vectors all vectors model binary	65M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	69M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	134M
Icelandic	is	OpenSubtitles	top 1M vectors all vectors model binary	7M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	7M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	15M
Italian	it	OpenSubtitles	top 1M vectors all vectors model binary	278M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	476M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	754M
Georgian	ka	OpenSubtitles	top 1M vectors all vectors model binary	1M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	15M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	16M
Kazakh	kk	OpenSubtitles	top 1M vectors all vectors model binary	13K	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	18M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	18M
Korean	ko	OpenSubtitles	top 1M vectors all vectors model binary	7M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	63M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	70M
Lithuanian	lt	OpenSubtitles	top 1M vectors all vectors model binary	6M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	23M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	29M
Latvian	lv	OpenSubtitles	top 1M vectors all vectors model binary	2M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	14M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	16M
Macedonian	mk	OpenSubtitles	top 1M vectors all vectors model binary	20M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	27M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	47M
Malayalam	ml	OpenSubtitles	top 1M vectors all vectors model binary	2M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	10M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	12M
Malay	ms	OpenSubtitles	top 1M vectors all vectors model binary	12M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	29M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	41M
Dutch	nl	OpenSubtitles	top 1M vectors all vectors model binary	265M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	249M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	514M
Norwegian	no	OpenSubtitles	top 1M vectors all vectors model binary	46M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	91M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	136M
Polish	pl	OpenSubtitles	top 1M vectors all vectors model binary	250M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	232M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	483M
Portuguese	pt	OpenSubtitles	top 1M vectors all vectors model binary	258M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	238M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	496M
Romanian	ro	OpenSubtitles	top 1M vectors all vectors model binary	435M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	65M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	500M
Russian	ru	OpenSubtitles	top 1M vectors all vectors model binary	152M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	391M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	543M
Sinhala	si	OpenSubtitles	top 1M vectors all vectors model binary	3M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	6M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	9M
Slovak	sk	OpenSubtitles	top 1M vectors all vectors model binary	47M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	29M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	76M
Slovenian	sl	OpenSubtitles	top 1M vectors all vectors model binary	107M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	32M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	138M
Albanian	sq	OpenSubtitles	top 1M vectors all vectors model binary	12M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	18M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	30M
Serbian	sr	OpenSubtitles	top 1M vectors all vectors model binary	344M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	70M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	413M
Swedish	sv	OpenSubtitles	top 1M vectors all vectors model binary	101M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	143M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	245M
Tamil	ta	OpenSubtitles	top 1M vectors all vectors model binary	123K	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	17M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	17M
Telugu	te	OpenSubtitles	top 1M vectors all vectors model binary	103K	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	15M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	15M
Tagalog	tl	OpenSubtitles	top 1M vectors all vectors model binary	88K	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	7M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	7M
Turkish	tr	OpenSubtitles	top 1M vectors all vectors model binary	240M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	55M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	295M
Ukrainian	uk	OpenSubtitles	top 1M vectors all vectors model binary	5M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	163M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	168M
Urdu	ur	OpenSubtitles	top 1M vectors all vectors model binary	196K	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	16M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	16M
Vietnamese	vi	OpenSubtitles	top 1M vectors all vectors model binary	27M	word counts bigram counts trigram counts
		Wikipedia	top 1M vectors all vectors model binary	115M	word counts bigram counts trigram counts
		Wikipedia + OpenSubtitles	top 1M vectors all vectors model binary	143M

jvparidon / subs2vec