Afrikaans |
af |
OpenSubtitles |
top 1M vectors all vectors model binary |
324K |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
17M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
17M |
|
Arabic |
ar |
OpenSubtitles |
top 1M vectors all vectors model binary |
188M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
120M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
308M |
|
Bulgarian |
bg |
OpenSubtitles |
top 1M vectors all vectors model binary |
247M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
53M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
300M |
|
Bengali |
bn |
OpenSubtitles |
top 1M vectors all vectors model binary |
2M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
19M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
21M |
|
Breton |
br |
OpenSubtitles |
top 1M vectors all vectors model binary |
111K |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
8M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
8M |
|
Bosnian |
bs |
OpenSubtitles |
top 1M vectors all vectors model binary |
92M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
13M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
105M |
|
Catalan |
ca |
OpenSubtitles |
top 1M vectors all vectors model binary |
3M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
176M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
179M |
|
Czech |
cs |
OpenSubtitles |
top 1M vectors all vectors model binary |
249M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
100M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
349M |
|
Danish |
da |
OpenSubtitles |
top 1M vectors all vectors model binary |
87M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
56M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
143M |
|
German |
de |
OpenSubtitles |
top 1M vectors all vectors model binary |
139M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
976M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
1B |
|
Greek |
el |
OpenSubtitles |
top 1M vectors all vectors model binary |
271M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
58M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
329M |
|
English |
en |
OpenSubtitles |
top 1M vectors all vectors model binary |
751M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
2B |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
3B |
|
Esperanto |
eo |
OpenSubtitles |
top 1M vectors all vectors model binary |
382K |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
38M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
38M |
|
Spanish |
es |
OpenSubtitles |
top 1M vectors all vectors model binary |
514M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
586M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
1B |
|
Estonian |
et |
OpenSubtitles |
top 1M vectors all vectors model binary |
60M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
29M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
90M |
|
Basque |
eu |
OpenSubtitles |
top 1M vectors all vectors model binary |
3M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
20M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
24M |
|
Farsi |
fa |
OpenSubtitles |
top 1M vectors all vectors model binary |
45M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
87M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
132M |
|
Finnish |
fi |
OpenSubtitles |
top 1M vectors all vectors model binary |
117M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
74M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
191M |
|
French |
fr |
OpenSubtitles |
top 1M vectors all vectors model binary |
336M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
724M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
1B |
|
Galician |
gl |
OpenSubtitles |
top 1M vectors all vectors model binary |
2M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
40M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
42M |
|
Hebrew |
he |
OpenSubtitles |
top 1M vectors all vectors model binary |
170M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
133M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
303M |
|
Hindi |
hi |
OpenSubtitles |
top 1M vectors all vectors model binary |
660K |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
31M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
32M |
|
Croatian |
hr |
OpenSubtitles |
top 1M vectors all vectors model binary |
242M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
43M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
285M |
|
Hungarian |
hu |
OpenSubtitles |
top 1M vectors all vectors model binary |
228M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
121M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
349M |
|
Armenian |
hy |
OpenSubtitles |
top 1M vectors all vectors model binary |
24K |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
38M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
39M |
|
Indonesian |
id |
OpenSubtitles |
top 1M vectors all vectors model binary |
65M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
69M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
134M |
|
Icelandic |
is |
OpenSubtitles |
top 1M vectors all vectors model binary |
7M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
7M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
15M |
|
Italian |
it |
OpenSubtitles |
top 1M vectors all vectors model binary |
278M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
476M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
754M |
|
Georgian |
ka |
OpenSubtitles |
top 1M vectors all vectors model binary |
1M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
15M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
16M |
|
Kazakh |
kk |
OpenSubtitles |
top 1M vectors all vectors model binary |
13K |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
18M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
18M |
|
Korean |
ko |
OpenSubtitles |
top 1M vectors all vectors model binary |
7M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
63M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
70M |
|
Lithuanian |
lt |
OpenSubtitles |
top 1M vectors all vectors model binary |
6M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
23M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
29M |
|
Latvian |
lv |
OpenSubtitles |
top 1M vectors all vectors model binary |
2M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
14M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
16M |
|
Macedonian |
mk |
OpenSubtitles |
top 1M vectors all vectors model binary |
20M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
27M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
47M |
|
Malayalam |
ml |
OpenSubtitles |
top 1M vectors all vectors model binary |
2M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
10M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
12M |
|
Malay |
ms |
OpenSubtitles |
top 1M vectors all vectors model binary |
12M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
29M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
41M |
|
Dutch |
nl |
OpenSubtitles |
top 1M vectors all vectors model binary |
265M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
249M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
514M |
|
Norwegian |
no |
OpenSubtitles |
top 1M vectors all vectors model binary |
46M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
91M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
136M |
|
Polish |
pl |
OpenSubtitles |
top 1M vectors all vectors model binary |
250M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
232M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
483M |
|
Portuguese |
pt |
OpenSubtitles |
top 1M vectors all vectors model binary |
258M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
238M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
496M |
|
Romanian |
ro |
OpenSubtitles |
top 1M vectors all vectors model binary |
435M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
65M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
500M |
|
Russian |
ru |
OpenSubtitles |
top 1M vectors all vectors model binary |
152M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
391M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
543M |
|
Sinhala |
si |
OpenSubtitles |
top 1M vectors all vectors model binary |
3M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
6M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
9M |
|
Slovak |
sk |
OpenSubtitles |
top 1M vectors all vectors model binary |
47M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
29M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
76M |
|
Slovenian |
sl |
OpenSubtitles |
top 1M vectors all vectors model binary |
107M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
32M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
138M |
|
Albanian |
sq |
OpenSubtitles |
top 1M vectors all vectors model binary |
12M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
18M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
30M |
|
Serbian |
sr |
OpenSubtitles |
top 1M vectors all vectors model binary |
344M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
70M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
413M |
|
Swedish |
sv |
OpenSubtitles |
top 1M vectors all vectors model binary |
101M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
143M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
245M |
|
Tamil |
ta |
OpenSubtitles |
top 1M vectors all vectors model binary |
123K |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
17M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
17M |
|
Telugu |
te |
OpenSubtitles |
top 1M vectors all vectors model binary |
103K |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
15M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
15M |
|
Tagalog |
tl |
OpenSubtitles |
top 1M vectors all vectors model binary |
88K |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
7M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
7M |
|
Turkish |
tr |
OpenSubtitles |
top 1M vectors all vectors model binary |
240M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
55M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
295M |
|
Ukrainian |
uk |
OpenSubtitles |
top 1M vectors all vectors model binary |
5M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
163M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
168M |
|
Urdu |
ur |
OpenSubtitles |
top 1M vectors all vectors model binary |
196K |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
16M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
16M |
|
Vietnamese |
vi |
OpenSubtitles |
top 1M vectors all vectors model binary |
27M |
word counts bigram counts trigram counts |
|
|
Wikipedia |
top 1M vectors all vectors model binary |
115M |
word counts bigram counts trigram counts |
|
|
Wikipedia + OpenSubtitles |
top 1M vectors all vectors model binary |
143M |
|