goodbai-nlp / BiKCCA

Code for our paper "Improving Vector Space Word Representations Via Kernel Canonical Correlation Analysis" in TALLIP

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BiKCCA

Code for our paper "Improving Vector Space Word Representations Via Kernel Canonical Correlation Analysis" in TALLIP [pdf]

Setup

This software runs python 3.6 with the following libraries:

  • numpy 1.16.2
  • scikit-learn 0.20.2

Get start

  1. Preparing monolingual word embeddings and dictionaris.
    $word2vec/word2vec -train $corpus_en -window 5 -iter 10 -size 200 -threads 16 -output embeddings_size200.en 
    $word2vec/word2vec -train $corpus_zh -window 5 -iter 10 -size 200 -threads 16 -output embeddings_size200.zh 
  1. Generating bilingual word embeddings with our method (BiKCCA).
    python train.py -slang $src_lang -tlang $tgt_lang -semb $src_path -temb $tgt_path -d $dict_path -reg 0.3  -g1 0.001  -g2 0.001
The `reg`, `g1` and `g2` are hyperparameters of KCCA, which can be tuned on valid dataset.
  1. The resulted bilingual word embeddings will be stored at directory output/src_lang-tgt_lang/

  2. To evaluate the bilingual word embeddings, please refer to the code of this work

References

Please cite Learning Improving Vector Space Word Representations Via Kernel Canonical Correlation Analysis if you found the resources in this repository useful.

  @article{BaiCZ-18-tallip,
   author = {Bai, Xuefeng and Cao, Hailong and Zhao, Tiejun},
   title = {Improving Vector Space Word Representations Via Kernel Canonical Correlation Analysis},
   journal = {ACM Transactions on Asian and Low-Resource Language Information Processing},
   issue_date = {August 2018},
   publisher = {ACM},
   address = {New York, NY, USA}
  } 

About

Code for our paper "Improving Vector Space Word Representations Via Kernel Canonical Correlation Analysis" in TALLIP

License:MIT License


Languages

Language:Python 92.0%Language:Shell 8.0%