crystalajj / Word2Vec_for_Chinese_Corpus

Train character vector representation for Chinese Corpus.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Word2Vec_for_Chinese_Corpus

Train character vector representation for Chinese Corpus.

Processing Outline

1. Extract Chinses corpus files from Wiki dump

Use data set: zhwiki-20170220-pages-articles1.xml.bz2

Data Download Site

Perform WikiExtractor.py to get wiki_00~wiki_07 files.

Use

cat wiki_* > processed_zhwiki.txt

to get these files together.

2. Chinese Convert

Perform Traditional Chinese to Simplified Chinese conversion by using openCC.

Common Methods of Chinese convert:

code:

/usr/local/Cellar/opencc/1.0.4/bin/opencc   -i processed_zhwiki.txt  -o transformed_zh_wiki -c /usr/local/Cellar/opencc/1.0.4/share/opencc/t2s.json 

3. Delete empty brackets

Delete empty brackets caused by using WikiExtractor.py

4. Segmentation

Execute Tokenization.py to perform segmentation by using Jieba.

Common Methods of segmentation:

Methods of Chinese Segmentation Algorithm Related Link
Jieba Based on a prefix dictionary structure to achieve efficient word graph scanning. Build a directed acyclic graph (DAG) for all possible word combinations.Use dynamic programming to find the most probable combination based on the word frequency.For unknown words, a HMM-based model is used with the Viterbi algorithm. Github
THULAC(THU Lexical Analyzer for Chinese) Based on Structured Perceptron Github paper(2009)
StanfordSegmenter Based on CRF Github Tutorials paper(2005) paper(2008)

5. Word2Vec(Skip-gram) In Gensim

Perform to Word2Vec_train.py to train character vector for Chinese corpus.

Parameter Set:

  • sg = 1 # use skip-gram
  • hs = 0 and negative=5 # use negative sample not hierarchical softmax
  • size = 100 # the dimensionality of the feature vectors
  • alpha = 0.025 # learning rate
  • window = 5 # content window
  • min_count=5 # ignore all words with total frequency lower than 5
  • sample = 0.001 # threshold for configuring which higher-frequency words are randomly downsampled; default is 1e-3, useful range is (0, 1e-5).
  • batch_words = 10000 # target size for batches of examples passed to worker threads

API document of Word2Vec in gensim

You can review the results at here.

About

Train character vector representation for Chinese Corpus.


Languages

Language:Python 100.0%