jimdi / word2vec

A word2vec port for Windows with CMake support

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

word2vec
========

- makefile and some source has been modified for Windows compilation
- memory patch for word2vec has been applied https://code.google.com/p/word2vec/issues/detail?id=2
- CMake support
i used original word2vec forked from googlecode to github and word2vec windows patches from https://github.com/zhangyafeikimi/word2vec-win32

Tools for computing distributed representtion of words
------------------------------------------------------

We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG), as well as several demo scripts.

Given a text corpus, the word2vec tool learns a vector for every word in the vocabulary using the Continuous
Bag-of-Words or the Skip-Gram neural network architectures. The user should to specify the following:
 - desired vector dimensionality
 - the size of the context window for either the Skip-Gram or the Continuous Bag-of-Words model
 - training algorithm: hierarchical softmax and / or negative sampling
 - threshold for downsampling the frequent words 
 - number of threads to use
 - the format of the output word vector file (text or binary)

Usually, the other hyper-parameters such as the learning rate do not need to be tuned for different training sets. 

The script demo-word.sh downloads a small (100MB) text corpus from the web, and trains a small word vector model. After the training
is finished, the user can interactively explore the similarity of the words.

More information about the scripts is provided at https://code.google.com/p/word2vec/

About

A word2vec port for Windows with CMake support

License:Apache License 2.0


Languages

Language:C 83.5%Language:Shell 14.6%Language:CMake 1.0%Language:Makefile 0.9%