severinsimmler / word-embeddings

A collection of code for embedding text in a multi-dimensional space.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Installation

Pipenv automatically creates and manages a virtualenv for this project. Installation as usual:

$ pip install pipenv

To install the project’s dependencies:

$ pipenv install

You can spawn a shell:

$ pipenv shell

or a command installed into the virtual environment, for example:

$ pipenv run python cli.py --help

Getting started

$ python cli.py --help
usage: matrix-tool [-h] [--corpus CORPUS] [--suffix SUFFIX] [--lowercase]
                   [--mfw MFW] [--mfw_pkl MFW_PKL] [--n-mfw N_MFW]
                   [--window WINDOW] [--sentences] [--output OUTPUT]
                   [--stopwords STOPWORDS] [--term TERM] [--sublinear_tf]
                   [--tfidf TFIDF]

CLI tool to process a Wikipedia dump to a word-word matrix.

optional arguments:
  -h, --help            show this help message and exit
  --corpus CORPUS       Path to corpus directory.
  --suffix SUFFIX       Suffix of the text files.
  --lowercase           Use this parameter to lowercase all letters.
  --mfw MFW             Path to JSON file with most frequent words.
  --mfw_pkl MFW_PKL     Path to pickle file with most frequent words.
  --n-mfw N_MFW         Count tokens and use the n most frequent words.
  --window WINDOW       Context window size.
  --sentences           Use sentences instead of lines.
  --output OUTPUT       Path to output directory.
  --stopwords STOPWORDS
                        Optional external stopwords list.
  --term TERM           Get top 50 nearest neighbors for this term.
  --tfidf TFIDF         Use tf-idf weighting on the word-word matrix. Allowed values are: document, global_transform.
  --sublinear_tf        Apply sublinear tf scaling, i.e. replace tf with 1 +
                        log(tf).

Example

These are the top 20 nearest neighbors for the term stadt with varying parameters. The word frequencies were determined by sliding over the entire corpus with a window of 2. The frequencies are TF-IDF weighted. For all vectors the cosine similarity was calculated. The corpus contained a total of 1,981,189 articles from the German Wikipedia.

Context for IDF (in TF-IDF)

$ python cli.py --corpus wikipedia --suffix .txt --lowercase --mfw mfw.json --tfidf global_transform --term stadt
Rank Term Cosine similarity
1 gemeinde 0.477483
2 marktes 0.425700
3 kreisstadt 0.423766
4 gelegen 0.381238
5 stadtteil 0.380550
6 ortes 0.380051
7 stadtteils 0.377508
8 ansässig 0.357279
9 ort 0.349270
10 ortslage 0.345554
11 dorf 0.344673
12 kernstadt 0.336407
13 ortschaft 0.335691
14 stadtgemeinde 0.332217
15 hof 0.330435
16 ortsteils 0.326245
17 berlins 0.324712
18 marktgemeinde 0.323925
19 landkreise 0.316252
20 wohnplatz 0.315749

Articles for IDF (in TF-IDF)

Without sublinear TF scaling

$ python cli.py --corpus wikipedia --suffix .txt --lowercase --mfw mfw.json --tfidf document --term stadt
Rank Term Cosine similarity
1 industrie 0.402932
2 staat 0.401412
3 musée 0.381124
4 zeitgenössische 0.321478
5 patienten 0.310556
6 statue 0.292731
7 ersetzt 0.220614
8 verlag 0.207787
9 klassische 0.187874
10 stewart 0.187797
11 holland 0.187020
12 schriftstellerin 0.182546
13 landkreis 0.182546
14 beginnen 0.174989
15 arabische 0.162990
16 begründete 0.161565
17 gesichert 0.159022
18 bus 0.156092
19 öl 0.143031
20 wende 0.142508

With sublinear TF scaling

$ python cli.py --corpus wikipedia --suffix .txt --lowercase --mfw mfw.json --tfidf global_transform --term stadt --sublinear_tf
Rank Term Cosine similarity
1 ort 0.476595
2 dorf 0.471737
3 bahnhof 0.457258
4 bezirk 0.442929
5 insel 0.439883
6 allee 0.424844
7 ortschaft 0.417953
8 anlage 0.417635
9 umbenannt 0.414289
10 ebenfalls 0.411907
11 burg 0.408800
12 gelegen 0.407507
13 stadtteils 0.403053
14 straße 0.401647
15 kernstadt 0.400973
16 orts 0.398161
17 ortslage 0.395894
18 ansässig 0.388783
19 hütte 0.383289
20 stadtgemeinde 0.378792

About

A collection of code for embedding text in a multi-dimensional space.

License:MIT License


Languages

Language:Jupyter Notebook 91.9%Language:Python 8.1%