viniciusarruda/word2vec

embeddings embeddings-similarity nlp python pytorch tensorboard-pytorch word2vec

Source: The Illustrated Word2vec

Yet Another Word2Vec Implementation

About

Implementation of Continuous Bag-of-Words (CBOW) in pytorch.

Features:

Train a CBOW from scratch
Log training to tensorboard
Visualize embeddings with t-SNE/PCA/UMAP using tensorboard.
Implements a most_similar function with the same behavior and results of the most_similar function implemented by the Gensim library.

Installation

Note:

This project was developed using Windows 11 with python 3.10.0.

Clone this repo, create a new environment (recommended) and install the dependencies:

pip install -r requirements.txt

Usage

Train a CBOW model

Download the dataset WikiText-2 or WikiText-103 here and move it into the dataset folder.

Edit the config.toml accordingly, then:

python main.py

To use tensorboard (setting scalars to show all datapoints):

tensorboard --logdir .\experiment\wikitext-2\ --samples_per_plugin scalars=300000

Compute analogies

To compute the analogies and summarize them using the word-test.v1.txt, the original test set file from the word2vec paper.

To run the original trained word2vec (it will download the model):

python compute_analogies.py word2vec-google-news-300

The results from the above script can be seen below.

To run with a trained word2vec, use the path from a txt file containing the word vectors:

python compute_analogies.py <path-to-txt-word-vectors>

Checking most_similar implementation

most_similar is a function from the Gensim which retrieves the top-N most similar embeddings. The goal of the most_similar_implementation_check.py script is to assert the equality of results between most_similar implementation.

To run the original trained word2vec (it will download the model):

python most_similar_implementation_check.py word2vec-google-news-300

Or use the path from a txt file containing the word vectors:

python most_similar_implementation_check.py <path-to-txt-word-vectors>

Results

word2vec-google-news-300

Analogy Class	OOV	not OOV	Top1	Top5	Total
capital-common-countries	0 (0.00%)	506 (100.00%)	421 (83.20%)	482 (95.26%)	506
capital-world	0 (0.00%)	4524 (100.00%)	3580 (79.13%)	4124 (91.16%)	4524
currency	0 (0.00%)	866 (100.00%)	304 (35.10%)	431 (49.77%)	866
city-in-state	0 (0.00%)	2467 (100.00%)	1749 (70.90%)	2127 (86.22%)	2467
family	0 (0.00%)	506 (100.00%)	428 (84.58%)	482 (95.26%)	506
gram1-adjective-to-adverb	0 (0.00%)	992 (100.00%)	283 (28.53%)	509 (51.31%)	992
gram2-opposite	0 (0.00%)	812 (100.00%)	347 (42.73%)	457 (56.28%)	812
gram3-comparative	0 (0.00%)	1332 (100.00%)	1210 (90.84%)	1295 (97.22%)	1332
gram4-superlative	0 (0.00%)	1122 (100.00%)	980 (87.34%)	1102 (98.22%)	1122
gram5-present-participle	0 (0.00%)	1056 (100.00%)	825 (78.12%)	1004 (95.08%)	1056
gram6-nationality-adjective	0 (0.00%)	1599 (100.00%)	1438 (89.93%)	1527 (95.50%)	1599
gram7-past-tense	0 (0.00%)	1560 (100.00%)	1029 (65.96%)	1459 (93.53%)	1560
gram8-plural	0 (0.00%)	1332 (100.00%)	1197 (89.86%)	1275 (95.72%)	1332
gram9-plural-verbs	0 (0.00%)	870 (100.00%)	591 (67.93%)	785 (90.23%)	870
Total	0 (0.00%)	19544 (100.00%)	14382 (73.59%)	17059 (87.29%)	19544

Resources

Links regarding the most_similar and analogy computation: 1, 2, 3, 4.
Tensorboard: 1, 2, 3.

About

Yet Another Word2Vec Implementation

embeddings embeddings-similarity nlp python pytorch tensorboard-pytorch word2vec

Languages

Language:Python 100.0%