Installation
Pipenv automatically creates and manages a virtualenv for this project. Installation as usual:
$ pip install pipenv
To install the project’s dependencies:
$ pipenv install
You can spawn a shell:
$ pipenv shell
or a command installed into the virtual environment, for example:
$ pipenv run python cli.py --help
Getting started
$ python cli.py --help
usage: matrix-tool [-h] [--corpus CORPUS] [--suffix SUFFIX] [--lowercase]
[--mfw MFW] [--mfw_pkl MFW_PKL] [--n-mfw N_MFW]
[--window WINDOW] [--sentences] [--output OUTPUT]
[--stopwords STOPWORDS] [--term TERM] [--sublinear_tf]
[--tfidf TFIDF]
CLI tool to process a Wikipedia dump to a word-word matrix.
optional arguments:
-h, --help show this help message and exit
--corpus CORPUS Path to corpus directory.
--suffix SUFFIX Suffix of the text files.
--lowercase Use this parameter to lowercase all letters.
--mfw MFW Path to JSON file with most frequent words.
--mfw_pkl MFW_PKL Path to pickle file with most frequent words.
--n-mfw N_MFW Count tokens and use the n most frequent words.
--window WINDOW Context window size.
--sentences Use sentences instead of lines.
--output OUTPUT Path to output directory.
--stopwords STOPWORDS
Optional external stopwords list.
--term TERM Get top 50 nearest neighbors for this term.
--tfidf TFIDF Use tf-idf weighting on the word-word matrix. Allowed values are: document, global_transform.
--sublinear_tf Apply sublinear tf scaling, i.e. replace tf with 1 +
log(tf).
Example
These are the top 20 nearest neighbors for the term stadt
with varying parameters. The word frequencies were determined by sliding over the entire corpus with a window of 2. The frequencies are TF-IDF weighted. For all vectors the cosine similarity was calculated. The corpus contained a total of 1,981,189 articles from the German Wikipedia.
Context for IDF (in TF-IDF)
$ python cli.py --corpus wikipedia --suffix .txt --lowercase --mfw mfw.json --tfidf global_transform --term stadt
Rank | Term | Cosine similarity |
---|---|---|
1 | gemeinde | 0.477483 |
2 | marktes | 0.425700 |
3 | kreisstadt | 0.423766 |
4 | gelegen | 0.381238 |
5 | stadtteil | 0.380550 |
6 | ortes | 0.380051 |
7 | stadtteils | 0.377508 |
8 | ansässig | 0.357279 |
9 | ort | 0.349270 |
10 | ortslage | 0.345554 |
11 | dorf | 0.344673 |
12 | kernstadt | 0.336407 |
13 | ortschaft | 0.335691 |
14 | stadtgemeinde | 0.332217 |
15 | hof | 0.330435 |
16 | ortsteils | 0.326245 |
17 | berlins | 0.324712 |
18 | marktgemeinde | 0.323925 |
19 | landkreise | 0.316252 |
20 | wohnplatz | 0.315749 |
Articles for IDF (in TF-IDF)
Without sublinear TF scaling
$ python cli.py --corpus wikipedia --suffix .txt --lowercase --mfw mfw.json --tfidf document --term stadt
Rank | Term | Cosine similarity |
---|---|---|
1 | industrie | 0.402932 |
2 | staat | 0.401412 |
3 | musée | 0.381124 |
4 | zeitgenössische | 0.321478 |
5 | patienten | 0.310556 |
6 | statue | 0.292731 |
7 | ersetzt | 0.220614 |
8 | verlag | 0.207787 |
9 | klassische | 0.187874 |
10 | stewart | 0.187797 |
11 | holland | 0.187020 |
12 | schriftstellerin | 0.182546 |
13 | landkreis | 0.182546 |
14 | beginnen | 0.174989 |
15 | arabische | 0.162990 |
16 | begründete | 0.161565 |
17 | gesichert | 0.159022 |
18 | bus | 0.156092 |
19 | öl | 0.143031 |
20 | wende | 0.142508 |
With sublinear TF scaling
$ python cli.py --corpus wikipedia --suffix .txt --lowercase --mfw mfw.json --tfidf global_transform --term stadt --sublinear_tf
Rank | Term | Cosine similarity |
---|---|---|
1 | ort | 0.476595 |
2 | dorf | 0.471737 |
3 | bahnhof | 0.457258 |
4 | bezirk | 0.442929 |
5 | insel | 0.439883 |
6 | allee | 0.424844 |
7 | ortschaft | 0.417953 |
8 | anlage | 0.417635 |
9 | umbenannt | 0.414289 |
10 | ebenfalls | 0.411907 |
11 | burg | 0.408800 |
12 | gelegen | 0.407507 |
13 | stadtteils | 0.403053 |
14 | straße | 0.401647 |
15 | kernstadt | 0.400973 |
16 | orts | 0.398161 |
17 | ortslage | 0.395894 |
18 | ansässig | 0.388783 |
19 | hütte | 0.383289 |
20 | stadtgemeinde | 0.378792 |