heshizhu/word2vec_torch

Word2Vec in Torch 
Yoon Kim
yhk255@nyu.edu

Only has the skip-gram architecture with negative sampling. See https://code.google.com/p/word2vec/ for more details.

Note: This is considerably slower than the word2vec toolkit and gensim implementations.

Input file is a text file where each line represents one sentence (see corpus.txt for an example)

Arguments are mostly self-explanatory (see main.lua for default arguments)

-corpus : text file with the corpus
-window : max window size
-dim : dimensionality of word embeddings
-alpha : exponent to smooth out unigram distribution 
-table_size : unigram table size. if you have plenty of RAM, bring this up to 10^8
-neg_samples : number of negative samples for each valid word-context pair
-minfreq : minimum frequency to be included in the vocab
-lr : starting learning rate
-min_lr : minimum learning rate--lr will linearly decay to this value
-epochs : number of epochs to run
-stream : whether to stream text data from HD or store in memory (1 = stream, 0 = not)
-gpu : whether to use gpu (1 = use gpu, 0 = not)

For example:

CPU:
th main.lua -corpus corpus.txt -window 3 -dim 100 -minfreq 10 -stream 1 -gpu 0 

GPU:
th main.lua -corpus corpus.txt -window 3 -dim 100 -minfreq 10 -stream 0 -gpu 1
heshizhu / word2vec_torch

About

Languages