viniciusarruda / word2vec

Yet Another Word2Vec Implementation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


Source: The Illustrated Word2vec

Yet Another Word2Vec Implementation

About

Implementation of Continuous Bag-of-Words (CBOW) in pytorch.

Features:

  • Train a CBOW from scratch
  • Log training to tensorboard
  • Visualize embeddings with t-SNE/PCA/UMAP using tensorboard.
  • Implements a most_similar function with the same behavior and results of the most_similar function implemented by the Gensim library.

Installation

Note:

This project was developed using Windows 11 with python 3.10.0.

Clone this repo, create a new environment (recommended) and install the dependencies:

pip install -r requirements.txt

Usage

Train a CBOW model

Download the dataset WikiText-2 or WikiText-103 here and move it into the dataset folder.

Edit the config.toml accordingly, then:

python main.py

To use tensorboard (setting scalars to show all datapoints):

tensorboard --logdir .\experiment\wikitext-2\ --samples_per_plugin scalars=300000

Compute analogies

To compute the analogies and summarize them using the word-test.v1.txt, the original test set file from the word2vec paper.

To run the original trained word2vec (it will download the model):

python compute_analogies.py word2vec-google-news-300

The results from the above script can be seen below.

To run with a trained word2vec, use the path from a txt file containing the word vectors:

python compute_analogies.py <path-to-txt-word-vectors>

Checking most_similar implementation

most_similar is a function from the Gensim which retrieves the top-N most similar embeddings. The goal of the most_similar_implementation_check.py script is to assert the equality of results between most_similar implementation.

To run the original trained word2vec (it will download the model):

python most_similar_implementation_check.py word2vec-google-news-300

Or use the path from a txt file containing the word vectors:

python most_similar_implementation_check.py <path-to-txt-word-vectors>

Results

word2vec-google-news-300

Analogy Class OOV not OOV Top1 Top5 Total
capital-common-countries 0 (0.00%) 506 (100.00%) 421 (83.20%) 482 (95.26%) 506
capital-world 0 (0.00%) 4524 (100.00%) 3580 (79.13%) 4124 (91.16%) 4524
currency 0 (0.00%) 866 (100.00%) 304 (35.10%) 431 (49.77%) 866
city-in-state 0 (0.00%) 2467 (100.00%) 1749 (70.90%) 2127 (86.22%) 2467
family 0 (0.00%) 506 (100.00%) 428 (84.58%) 482 (95.26%) 506
gram1-adjective-to-adverb 0 (0.00%) 992 (100.00%) 283 (28.53%) 509 (51.31%) 992
gram2-opposite 0 (0.00%) 812 (100.00%) 347 (42.73%) 457 (56.28%) 812
gram3-comparative 0 (0.00%) 1332 (100.00%) 1210 (90.84%) 1295 (97.22%) 1332
gram4-superlative 0 (0.00%) 1122 (100.00%) 980 (87.34%) 1102 (98.22%) 1122
gram5-present-participle 0 (0.00%) 1056 (100.00%) 825 (78.12%) 1004 (95.08%) 1056
gram6-nationality-adjective 0 (0.00%) 1599 (100.00%) 1438 (89.93%) 1527 (95.50%) 1599
gram7-past-tense 0 (0.00%) 1560 (100.00%) 1029 (65.96%) 1459 (93.53%) 1560
gram8-plural 0 (0.00%) 1332 (100.00%) 1197 (89.86%) 1275 (95.72%) 1332
gram9-plural-verbs 0 (0.00%) 870 (100.00%) 591 (67.93%) 785 (90.23%) 870
Total 0 (0.00%) 19544 (100.00%) 14382 (73.59%) 17059 (87.29%) 19544

Resources

  • Links regarding the most_similar and analogy computation: 1, 2, 3, 4.
  • Tensorboard: 1, 2, 3.

About

Yet Another Word2Vec Implementation


Languages

Language:Python 100.0%