BiKCCA

Code for our paper "Improving Vector Space Word Representations Via Kernel Canonical Correlation Analysis" in TALLIP [pdf]

Setup

This software runs python 3.6 with the following libraries:

numpy 1.16.2
scikit-learn 0.20.2

Get start

Preparing monolingual word embeddings and dictionaris.

    $word2vec/word2vec -train $corpus_en -window 5 -iter 10 -size 200 -threads 16 -output embeddings_size200.en 
    $word2vec/word2vec -train $corpus_zh -window 5 -iter 10 -size 200 -threads 16 -output embeddings_size200.zh

Generating bilingual word embeddings with our method (BiKCCA).

    python train.py -slang $src_lang -tlang $tgt_lang -semb $src_path -temb $tgt_path -d $dict_path -reg 0.3  -g1 0.001  -g2 0.001

The `reg`, `g1` and `g2` are hyperparameters of KCCA, which can be tuned on valid dataset.

The resulted bilingual word embeddings will be stored at directory output/src_lang-tgt_lang/
To evaluate the bilingual word embeddings, please refer to the code of this work

References

Please cite Learning Improving Vector Space Word Representations Via Kernel Canonical Correlation Analysis if you found the resources in this repository useful.

  @article{BaiCZ-18-tallip,
   author = {Bai, Xuefeng and Cao, Hailong and Zhao, Tiejun},
   title = {Improving Vector Space Word Representations Via Kernel Canonical Correlation Analysis},
   journal = {ACM Transactions on Asian and Low-Resource Language Information Processing},
   issue_date = {August 2018},
   publisher = {ACM},
   address = {New York, NY, USA}
  }

About

Code for our paper "Improving Vector Space Word Representations Via Kernel Canonical Correlation Analysis" in TALLIP

MIT License

Languages

Language:Python 92.0%Language:Shell 8.0%