ketencimert / word2vec-on-wikipedia

A pipeline for training word embeddings using word2vec on wikipedia corpus.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

word2vec-on-wikipedia

A pipeline for training word embeddings using word2vec on wikipedia corpus.

How to use

Just run sudo sh run.sh, which will:

  • Download the latest English wikipedia dump
  • Extract and clean texts from the downloaded wikipedia dump
  • Pre-process the wikipedia corpus
  • Train word2vec model on the processed corpus to produce word embedding results

Details for each step will be discussed as below:

Wikipedia dump

All the latest English wikipedia can be downloaded from a Wikipedia database dump. Here I downloaded all the article pages:

curl -L -O "https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2"

Wikipedia dump extraction

The original wikipedia dump that can be downloaded is in xml format and the structure is quite complex. Thus we need to use a extractor tool to parse it. The one I used is from the wikiextractor repository. Only the file WikiExtractor.py is needed and the descriptions of parameters can be found the in the repository readme file. The output would be each article id and its name followed by the content in text format.

python WikiExtractor.py enwiki-latest-pages-articles.xml.bz2 -b 1G -o extracted --no-template --processes 24

Text pre-processing

Before the word2vec training, the corpus needs to be pre-processed, which bascially includes: sentence spltting, sentence tokenization, removing sentences that contain less than 20 characters or 5 tokens, and converting all numerals to 0. For example, "1993" would be converted into "0000".

python wiki-corpus-prepare.py extracted/wiki processed/wiki

Here I used Stanford CoreNLP toolkit 3.8.0 for sentence tokenization. To use it, we need to set up a local server within the downloaded toolkit folder:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

In the script wiki-corpus-prepare.py, I used a python wrapper of the Stanford CoreNLP server so that we can manipulate the java server in python script.

Word2vec training

Once we get the processed wikipedia corpus ready, we can start the word2vec training. Here I used the Google word2vec tool, which is pretty standard and efficient. The tool is alrady in this repository, but in case you want to download the original one, you can find it here.

./word2vec -train ../processed/wiki -output ../results/enwiki.skip.size300.win10.neg15.sample1e-5.min10.bin -cbow 0 -size 300 -window 10 -negative 15 -hs 0 -sample 1e-5 -threads 24 -binary 1 -min-count 10

Evaluation of word embeddings

After we well-train the word embeddings, we always want to evaluate the performance for quality check. Here I used the word relation test set described in Efficient Estimation of Word Representations in Vector Space for performance test.

./compute-accuracy ../results/enwiki.skip.size300.win10.neg15.sample1e-5.min15.bin < questions-words.txt

In my experiments, the vocabulary of word embeddings I obtained is 833,976 and the token number of the corpus is 2,333,367,969. I generated several word embedding files for different vector sizes: 50, 100, 200, 300 and 400. For each file, I provide the downloadable link and its word relation test performance in the following table:

vector size Word relation test performance (%)
50 47.33
100 54.94
200 69.41
300 71.29
400 71.80

As you can see, the vector size can influence the word relation test performance, and within a certain range, the larger vector size, the better performance.

About

A pipeline for training word embeddings using word2vec on wikipedia corpus.

License:MIT License


Languages

Language:Python 66.1%Language:C 28.1%Language:Shell 5.5%Language:Makefile 0.4%