wenhaofang / Word2vec

Some demo word2vec models implemented with pytorch, including Continuous-Bag-Of-Words / Skip-Gram with Hierarchical-Softmax / Negative-Sampling.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Word To Vector Demo

This repository includes some demo word2vec models.

Note: The project refers to 动手学深度学习

Datasets:

  • dataset1: text8
  • dataset2: ptb

Models:

  • model1 (DONE): Continuous-Bag-Of-Words with Hierarchical-Softmax
  • model2 (DONE): Continuous-Bag-Of-Words with Negative-Sampling
  • model3 (DONE): Skip-Gram with Hierarchical-Softmax
  • model4 (DONE): Skip-Gram with Negative-Sampling
  • model5 (TODO): FastText
  • model6 (TODO): Glove

Data Process

# download dataset text8
PYTHONPATH=. python dataprocess/process.py --dataset_name text8
# download dataset ptb
PYTHONPATH=. python dataprocess/process.py --dataset_name ptb

Unit Test

  • for loader
# CBOW_HS_Loader
PYTHONPATH=. python loaders/CBOW_HS_Loader.py
# CBOW_HS_Loader: load data from cache
PYTHONPATH=. python loaders/CBOW_HS_Loader.py --use_cache

# CBOW_NS_Loader
PYTHONPATH=. python loaders/CBOW_NS_Loader.py
# CBOW_NS_Loader: load data from cache
PYTHONPATH=. python loaders/CBOW_NS_Loader.py --use_cache

# SG_HS_Loader
PYTHONPATH=. python loaders/SG_HS_Loader.py
# SG_HS_Loader: load data from cache
PYTHONPATH=. python loaders/SG_HS_Loader.py --use_cache

# SG_NS_Loader
PYTHONPATH=. python loaders/SG_NS_Loader.py
# SG_NS_Loader: load data from cache
PYTHONPATH=. python loaders/SG_NS_Loader.py --use_cache
  • for module
# CBOW_HS_Module
PYTHONPATH=. python modules/CBOW_HS_Module.py

# CBOW_NS_Module
PYTHONPATH=. python modules/CBOW_NS_Module.py

# SG_HS_Module
PYTHONPATH=. python modules/SG_HS_Module.py

# SG_NS_Module
PYTHONPATH=. python modules/SG_NS_Module.py

Main Process

  • for train
PYTHONPATH=. python main.py --mode train
  • for predict
PYTHONPATH=. python main.py --mode predict

You can change the config either in the command line or in the file utils/parser.py

Here are the examples for each module:

# CBOW_HS model
PYTHONPATH=. python main.py --module_type CBOW_HS --dataset_name text8
PYTHONPATH=. python main.py --module_type CBOW_HS --dataset_name ptb
# CBOW_NS model
PYTHONPATH=. python main.py --module_type CBOW_NS --dataset_name text8
PYTHONPATH=. python main.py --module_type CBOW_NS --dataset_name ptb
# SG_HS model
PYTHONPATH=. python main.py --module_type SG_HS --dataset_name text8
PYTHONPATH=. python main.py --module_type SG_HS --dataset_name ptb
# SG_NS model
PYTHONPATH=. python main.py --module_type SG_NS --dataset_name text8
PYTHONPATH=. python main.py --module_type SG_NS --dataset_name ptb

About

Some demo word2vec models implemented with pytorch, including Continuous-Bag-Of-Words / Skip-Gram with Hierarchical-Softmax / Negative-Sampling.


Languages

Language:Python 100.0%