severinsimmler/word-embeddings

Installation

Pipenv automatically creates and manages a virtualenv for this project. Installation as usual:

$ pip install pipenv

To install the project’s dependencies:

$ pipenv install

You can spawn a shell:

$ pipenv shell

or a command installed into the virtual environment, for example:

$ pipenv run python cli.py --help

Getting started

$ python cli.py --help
usage: matrix-tool [-h] [--corpus CORPUS] [--suffix SUFFIX] [--lowercase]
                   [--mfw MFW] [--mfw_pkl MFW_PKL] [--n-mfw N_MFW]
                   [--window WINDOW] [--sentences] [--output OUTPUT]
                   [--stopwords STOPWORDS] [--term TERM] [--sublinear_tf]
                   [--tfidf TFIDF]

CLI tool to process a Wikipedia dump to a word-word matrix.

optional arguments:
  -h, --help            show this help message and exit
  --corpus CORPUS       Path to corpus directory.
  --suffix SUFFIX       Suffix of the text files.
  --lowercase           Use this parameter to lowercase all letters.
  --mfw MFW             Path to JSON file with most frequent words.
  --mfw_pkl MFW_PKL     Path to pickle file with most frequent words.
  --n-mfw N_MFW         Count tokens and use the n most frequent words.
  --window WINDOW       Context window size.
  --sentences           Use sentences instead of lines.
  --output OUTPUT       Path to output directory.
  --stopwords STOPWORDS
                        Optional external stopwords list.
  --term TERM           Get top 50 nearest neighbors for this term.
  --tfidf TFIDF         Use tf-idf weighting on the word-word matrix. Allowed values are: document, global_transform.
  --sublinear_tf        Apply sublinear tf scaling, i.e. replace tf with 1 +
                        log(tf).

Example

These are the top 20 nearest neighbors for the term stadt with varying parameters. The word frequencies were determined by sliding over the entire corpus with a window of 2. The frequencies are TF-IDF weighted. For all vectors the cosine similarity was calculated. The corpus contained a total of 1,981,189 articles from the German Wikipedia.

Context for IDF (in TF-IDF)

$ python cli.py --corpus wikipedia --suffix .txt --lowercase --mfw mfw.json --tfidf global_transform --term stadt

Rank	Term	Cosine similarity
1	gemeinde	0.477483
2	marktes	0.425700
3	kreisstadt	0.423766
4	gelegen	0.381238
5	stadtteil	0.380550
6	ortes	0.380051
7	stadtteils	0.377508
8	ansässig	0.357279
9	ort	0.349270
10	ortslage	0.345554
11	dorf	0.344673
12	kernstadt	0.336407
13	ortschaft	0.335691
14	stadtgemeinde	0.332217
15	hof	0.330435
16	ortsteils	0.326245
17	berlins	0.324712
18	marktgemeinde	0.323925
19	landkreise	0.316252
20	wohnplatz	0.315749

Articles for IDF (in TF-IDF)

Without sublinear TF scaling

$ python cli.py --corpus wikipedia --suffix .txt --lowercase --mfw mfw.json --tfidf document --term stadt

Rank	Term	Cosine similarity
1	industrie	0.402932
2	staat	0.401412
3	musée	0.381124
4	zeitgenössische	0.321478
5	patienten	0.310556
6	statue	0.292731
7	ersetzt	0.220614
8	verlag	0.207787
9	klassische	0.187874
10	stewart	0.187797
11	holland	0.187020
12	schriftstellerin	0.182546
13	landkreis	0.182546
14	beginnen	0.174989
15	arabische	0.162990
16	begründete	0.161565
17	gesichert	0.159022
18	bus	0.156092
19	öl	0.143031
20	wende	0.142508

With sublinear TF scaling

$ python cli.py --corpus wikipedia --suffix .txt --lowercase --mfw mfw.json --tfidf global_transform --term stadt --sublinear_tf

Rank	Term	Cosine similarity
1	ort	0.476595
2	dorf	0.471737
3	bahnhof	0.457258
4	bezirk	0.442929
5	insel	0.439883
6	allee	0.424844
7	ortschaft	0.417953
8	anlage	0.417635
9	umbenannt	0.414289
10	ebenfalls	0.411907
11	burg	0.408800
12	gelegen	0.407507
13	stadtteils	0.403053
14	straße	0.401647
15	kernstadt	0.400973
16	orts	0.398161
17	ortslage	0.395894
18	ansässig	0.388783
19	hütte	0.383289
20	stadtgemeinde	0.378792

severinsimmler / word-embeddings