About
Implementation of Continuous Bag-of-Words (CBOW) in pytorch.
Features:
- Train a CBOW from scratch
- Log training to tensorboard
- Visualize embeddings with t-SNE/PCA/UMAP using tensorboard.
- Implements a
most_similar
function with the same behavior and results of themost_similar
function implemented by theGensim
library.
Installation
Note:
This project was developed using
Windows 11
withpython 3.10.0
.
Clone this repo, create a new environment (recommended) and install the dependencies:
pip install -r requirements.txt
Usage
Train a CBOW model
Download the dataset WikiText-2
or WikiText-103
here and move it into the dataset
folder.
Edit the config.toml
accordingly, then:
python main.py
To use tensorboard (setting scalars to show all datapoints):
tensorboard --logdir .\experiment\wikitext-2\ --samples_per_plugin scalars=300000
Compute analogies
To compute the analogies and summarize them using the word-test.v1.txt
, the original test set file from the word2vec paper.
To run the original trained word2vec (it will download the model):
python compute_analogies.py word2vec-google-news-300
The results from the above script can be seen below.
To run with a trained word2vec, use the path from a txt
file containing the word vectors:
python compute_analogies.py <path-to-txt-word-vectors>
Checking most_similar implementation
most_similar
is a function from the Gensim
which retrieves the top-N most similar embeddings. The goal of the most_similar_implementation_check.py
script is to assert the equality of results between most_similar
implementation.
To run the original trained word2vec (it will download the model):
python most_similar_implementation_check.py word2vec-google-news-300
Or use the path from a txt
file containing the word vectors:
python most_similar_implementation_check.py <path-to-txt-word-vectors>
Results
word2vec-google-news-300
Analogy Class | OOV | not OOV | Top1 | Top5 | Total |
---|---|---|---|---|---|
capital-common-countries | 0 (0.00%) | 506 (100.00%) | 421 (83.20%) | 482 (95.26%) | 506 |
capital-world | 0 (0.00%) | 4524 (100.00%) | 3580 (79.13%) | 4124 (91.16%) | 4524 |
currency | 0 (0.00%) | 866 (100.00%) | 304 (35.10%) | 431 (49.77%) | 866 |
city-in-state | 0 (0.00%) | 2467 (100.00%) | 1749 (70.90%) | 2127 (86.22%) | 2467 |
family | 0 (0.00%) | 506 (100.00%) | 428 (84.58%) | 482 (95.26%) | 506 |
gram1-adjective-to-adverb | 0 (0.00%) | 992 (100.00%) | 283 (28.53%) | 509 (51.31%) | 992 |
gram2-opposite | 0 (0.00%) | 812 (100.00%) | 347 (42.73%) | 457 (56.28%) | 812 |
gram3-comparative | 0 (0.00%) | 1332 (100.00%) | 1210 (90.84%) | 1295 (97.22%) | 1332 |
gram4-superlative | 0 (0.00%) | 1122 (100.00%) | 980 (87.34%) | 1102 (98.22%) | 1122 |
gram5-present-participle | 0 (0.00%) | 1056 (100.00%) | 825 (78.12%) | 1004 (95.08%) | 1056 |
gram6-nationality-adjective | 0 (0.00%) | 1599 (100.00%) | 1438 (89.93%) | 1527 (95.50%) | 1599 |
gram7-past-tense | 0 (0.00%) | 1560 (100.00%) | 1029 (65.96%) | 1459 (93.53%) | 1560 |
gram8-plural | 0 (0.00%) | 1332 (100.00%) | 1197 (89.86%) | 1275 (95.72%) | 1332 |
gram9-plural-verbs | 0 (0.00%) | 870 (100.00%) | 591 (67.93%) | 785 (90.23%) | 870 |
Total | 0 (0.00%) | 19544 (100.00%) | 14382 (73.59%) | 17059 (87.29%) | 19544 |