jvparidon / subs2vec

Tools for training and evaluating word embeddings based on subtitles. Published as "subs2vec: Word embeddings from subtitles in 55 languages" in Behavior Research Methods.

Home Page:https://doi.org/10.3758/s13428-020-01406-3

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

subs2vec

Van Paridon & Thompson (2019) introduces pretrained embeddings and precomputed word/bigram/trigram frequencies in 55 languages. The files can be downloaded from the links in this table. Word vectors trained on subtitles are available, as well as vectors trained on Wikipedia, and a combination of subtitles and Wikipedia (for best predictive performance).

This repository contains the subs2vec module, a number of Python 3.7 scripts and command line tools to evaluate a set of word vectors on semantic similarity, semantic and syntactic analogy, and lexical norm prediction tasks. In addition, the subs2vec.py script will take an OpenSubtitles archive or Wikipedia and go through all the steps to train a fastText model and produce word vectors as used in the paper associated with this repository.

Psycholinguists may be especially interested norms script, which evaluates the lexical norm prediction performance of a set of word vectors, but can also be used to predict lexical norms for un-normed words. For a more detailed explanation see the how to use -> extending lexical norms section.

The scripts in this repository require Python 3.7 and some additional libraries that are easily installed through pip. (If you want to use the subs2vec.py script to train your own word embeddings, you will also need compiled fastText and word2vec binaries.)

If you use any of the subs2vec code and/or pretrained models, please cite the preprint (Van Paridon & Thompson, 2019).

How to use

subs2vec is available through pip, installing is as easy as running:
python3 -m pip install subs2vec
Any missing dependencies should be installed automatically.

Each submodules of subs2vec can then be run as a command line tool using the -m flag:
python3 -m subs2vec.submodule_name

Evaluating word embeddings

To evaluate word embeddings on analogies, semantic similarity, or lexical norm prediction as in Van Paridon & Thompson (2019), use:
python3 -m subs2vec.analogies fr french_word_vectors.vec
python3 -m subs2vec.similarities fr french_word_vectors.vec
python3 -m subs2vec.norms fr french_word_vectors.vec
subs2vec uses the two-letter ISO language codes, so French in the example is fr, English would be en, German would be de, etc.

All datasets used for evaluation, including the lexical norms, are stored in subs2vec/evaluation/datasets/.
Results from Van Paridon & Thompson (2019) are in subs2vec/evaluation/article_results/.

Extending lexical norms

To extend lexical norms (either norms you have collected yourself, or norms provided in this repository) use:
python3 -m subs2vec.norms fr french_word_vectors.vec --extend_norms=french_norms_file.txt

The norms file should be a tab-separated text file, with the first line containing column names and the column containing the words should be called word. Unobserved cells should be left empty. If you are unsure how to generate this file, you can create your list in Excel and then use Save as... tab-delimited text.
For an overview of norms that come included in the repo (and their authors), see this list. For the norms datasets themselves, look inside this directory.

Extracting word frequencies

The subtitle corpus used to train subs2vec was also used to compile the word frequencies in SUBTLEX. That same corpus can of course be used to compile bigram and trigram frequencies as well.
To extract word, bigram, or trigram frequencies from a text file yourself, fr.txt for instance, use:
python3 -m subs2vec.frequencies fr.txt

In general, however, we recommend downloading the precompiled frequencies files from [language archive] and looking frequencies up in those.
When looking up frequencies for specific words, bigrams, or trigrams, you may find that you cannot open the frequencies files (they can be very large). To retrieve items of interest use:
python3 -m subs2vec.lookup frequencies_file.tsv list_of_items.txt
Your list of items should be a simple text file, with each item you want to look up on its own line.
This lookup scripts works for looking up frequencies, but it finds lines in any plain text file, so it works for looking up word vectors in .vec files as well.

Removing duplicate lines

subs2vec comes with a module that removes duplicate lines from text files. We used it to remove duplicate lines from training corpora, but it works for any text file.
To remove duplicates from fr.txt for example, use:
python3 -m subs2vec.deduplicate fr.txt

Training models

If you want to reproduce models as used in Van Paridon & Thompson (2019), you can use the train_model module.
For instance, the steps to create a subtitle corpus are:

  1. Download a corpus:
    python3 -m subs2vec.download fr subs
  2. Clean the corpus:
    python3 -m subs2vec.clean_subs fr --strip --join
  3. Deduplicate the lines in the corpus:
    python3 -m subs2vec.deduplicate fr.txt
  4. Train a fastText model on the subtitle corpus:
    python3 -m subs2vec.train_model fr subs dedup.fr.txt
    This last step requires the binaries for fastText and word2phrase (part of word2vec) to be downloaded, built, and discoverable on your system (i.e., on your PATH).

For more detailed training options:
python3 -m subs2vec.train_model --help

API

For more detailed documentation of the package modules and API, see subs2vec.readthedocs.io

Downloading datasets

This table contains links to the top 1 million word vectors in each language, as well all vectors, model binaries, and the word, bigram, and trigram frequencies in the subtitle and Wikipedia corpora. If you use these pretrained vectors/models, please cite the preprint (Van Paridon & Thompson, 2019).

language lang corpus vectors corpus word count ngram counts
Afrikaans af OpenSubtitles top 1M vectors
all vectors
model binary
324K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
17M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
17M
Arabic ar OpenSubtitles top 1M vectors
all vectors
model binary
188M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
120M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
308M
Bulgarian bg OpenSubtitles top 1M vectors
all vectors
model binary
247M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
53M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
300M
Bengali bn OpenSubtitles top 1M vectors
all vectors
model binary
2M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
19M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
21M
Breton br OpenSubtitles top 1M vectors
all vectors
model binary
111K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
8M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
8M
Bosnian bs OpenSubtitles top 1M vectors
all vectors
model binary
92M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
13M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
105M
Catalan ca OpenSubtitles top 1M vectors
all vectors
model binary
3M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
176M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
179M
Czech cs OpenSubtitles top 1M vectors
all vectors
model binary
249M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
100M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
349M
Danish da OpenSubtitles top 1M vectors
all vectors
model binary
87M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
56M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
143M
German de OpenSubtitles top 1M vectors
all vectors
model binary
139M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
976M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
1B
Greek el OpenSubtitles top 1M vectors
all vectors
model binary
271M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
58M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
329M
English en OpenSubtitles top 1M vectors
all vectors
model binary
751M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
2B word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
3B
Esperanto eo OpenSubtitles top 1M vectors
all vectors
model binary
382K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
38M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
38M
Spanish es OpenSubtitles top 1M vectors
all vectors
model binary
514M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
586M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
1B
Estonian et OpenSubtitles top 1M vectors
all vectors
model binary
60M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
29M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
90M
Basque eu OpenSubtitles top 1M vectors
all vectors
model binary
3M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
20M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
24M
Farsi fa OpenSubtitles top 1M vectors
all vectors
model binary
45M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
87M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
132M
Finnish fi OpenSubtitles top 1M vectors
all vectors
model binary
117M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
74M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
191M
French fr OpenSubtitles top 1M vectors
all vectors
model binary
336M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
724M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
1B
Galician gl OpenSubtitles top 1M vectors
all vectors
model binary
2M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
40M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
42M
Hebrew he OpenSubtitles top 1M vectors
all vectors
model binary
170M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
133M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
303M
Hindi hi OpenSubtitles top 1M vectors
all vectors
model binary
660K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
31M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
32M
Croatian hr OpenSubtitles top 1M vectors
all vectors
model binary
242M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
43M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
285M
Hungarian hu OpenSubtitles top 1M vectors
all vectors
model binary
228M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
121M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
349M
Armenian hy OpenSubtitles top 1M vectors
all vectors
model binary
24K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
38M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
39M
Indonesian id OpenSubtitles top 1M vectors
all vectors
model binary
65M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
69M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
134M
Icelandic is OpenSubtitles top 1M vectors
all vectors
model binary
7M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
7M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
15M
Italian it OpenSubtitles top 1M vectors
all vectors
model binary
278M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
476M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
754M
Georgian ka OpenSubtitles top 1M vectors
all vectors
model binary
1M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
15M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
16M
Kazakh kk OpenSubtitles top 1M vectors
all vectors
model binary
13K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
18M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
18M
Korean ko OpenSubtitles top 1M vectors
all vectors
model binary
7M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
63M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
70M
Lithuanian lt OpenSubtitles top 1M vectors
all vectors
model binary
6M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
23M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
29M
Latvian lv OpenSubtitles top 1M vectors
all vectors
model binary
2M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
14M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
16M
Macedonian mk OpenSubtitles top 1M vectors
all vectors
model binary
20M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
27M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
47M
Malayalam ml OpenSubtitles top 1M vectors
all vectors
model binary
2M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
10M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
12M
Malay ms OpenSubtitles top 1M vectors
all vectors
model binary
12M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
29M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
41M
Dutch nl OpenSubtitles top 1M vectors
all vectors
model binary
265M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
249M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
514M
Norwegian no OpenSubtitles top 1M vectors
all vectors
model binary
46M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
91M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
136M
Polish pl OpenSubtitles top 1M vectors
all vectors
model binary
250M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
232M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
483M
Portuguese pt OpenSubtitles top 1M vectors
all vectors
model binary
258M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
238M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
496M
Romanian ro OpenSubtitles top 1M vectors
all vectors
model binary
435M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
65M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
500M
Russian ru OpenSubtitles top 1M vectors
all vectors
model binary
152M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
391M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
543M
Sinhala si OpenSubtitles top 1M vectors
all vectors
model binary
3M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
6M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
9M
Slovak sk OpenSubtitles top 1M vectors
all vectors
model binary
47M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
29M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
76M
Slovenian sl OpenSubtitles top 1M vectors
all vectors
model binary
107M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
32M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
138M
Albanian sq OpenSubtitles top 1M vectors
all vectors
model binary
12M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
18M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
30M
Serbian sr OpenSubtitles top 1M vectors
all vectors
model binary
344M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
70M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
413M
Swedish sv OpenSubtitles top 1M vectors
all vectors
model binary
101M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
143M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
245M
Tamil ta OpenSubtitles top 1M vectors
all vectors
model binary
123K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
17M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
17M
Telugu te OpenSubtitles top 1M vectors
all vectors
model binary
103K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
15M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
15M
Tagalog tl OpenSubtitles top 1M vectors
all vectors
model binary
88K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
7M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
7M
Turkish tr OpenSubtitles top 1M vectors
all vectors
model binary
240M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
55M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
295M
Ukrainian uk OpenSubtitles top 1M vectors
all vectors
model binary
5M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
163M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
168M
Urdu ur OpenSubtitles top 1M vectors
all vectors
model binary
196K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
16M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
16M
Vietnamese vi OpenSubtitles top 1M vectors
all vectors
model binary
27M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
115M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
143M

About

Tools for training and evaluating word embeddings based on subtitles. Published as "subs2vec: Word embeddings from subtitles in 55 languages" in Behavior Research Methods.

https://doi.org/10.3758/s13428-020-01406-3

License:MIT License


Languages

Language:Python 100.0%