njuzrs / task1-word2vec

python, numpy

Task1: word2vec

标签（空格分隔）： python numpy

一、数据集

使用PTB数据集，链接为 http://www.fit.vutbr.cz/~imikolov/rnnlm/ 论文：Mikolov T, SutskeverI, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Adv ances in neural information processing systems. 2013: 3111-3119.

二、实现主要功能

CBOW和skip-gram两种训练模式
hierarchical softmax和negative sampling两种输出结构
subsampling of frequent words
phrase2vec: 将训练数据变成短语形式的输入
使用tensorboard进行降维可视化

三、实现细节

用python, numpy实现
sigmoid值提前算好，提升速度
使用min_count去除词频较少的词
训练多个epoches，每个epoches随机打乱训练集顺序

四、代码文件说明

word2vec.py是训练word2vec的文件，其中实现了word2vec的主要功能。
word2phrase.py是针对论文中的提出的基于短语的训练的程序，主要实现的功能是把原始的训练数据文件变成基于短语形式的文件。
visualize.py是针对训练结果进行可视化的代码，用的tensorboard进行可视化。

五、实验结果

CBOW，hierarchical softmax，subsample_rate=5e-3(根据训练集决定)
skip-gram，hierarchical softmax，subsample_rate=5e-3
CBOW，negative sampling，subsample_rate=5e-3
skip-gram，negative sampling，subsample_rate=5e-3
CBOW，hierarchical softmax，no subsampling
(phrase2vec) CBOW，hierarchical softmax，subsample_rate=5e-3

About

python, numpy

Languages

Language:Python 100.0%