vyraun / dlp

Code for "On Dimensional Linguistic Properties of the Word Embedding Space".

Home Page:https://www.aclweb.org/anthology/2020.repl4nlp-1.19/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This is the code for our ACL Repl4NLP 2020 publication titled On Dimensional Linguistic Properties of the Word Embedding Space.

Word embeddings have become a staple of several natural language processing tasks, yet much remains to be understood about their properties. In this work, we analyze word embeddings in terms of their principal components and arrive at a number of novel and counterintuitive observations. In particular, we characterize the utility of variance explained by the principal components as a proxy for downstream performance. Furthermore, through syntactic probing of the principal embedding space, we show that the syntactic information captured by a principal component does not correlate with the amount of variance it explains. Consequently, we investigate the limitations of variance based embedding post-processing algorithms and demonstrate that such post-processing is counter-productive in sentence classification and machine translation tasks. Finally, we offer a few precautionary guidelines on applying variance based embedding post-processing and explain why non-isotropic geometry might be integral to word embedding performance.

If you find our code useful, please cite our paper:

@inproceedings{raunak-etal-2020-dimensional,
    title = "On Dimensional Linguistic Properties of the Word Embedding Space",
    author = "Raunak, Vikas  and
      Kumar, Vaibhav  and
      Gupta, Vivek  and
      Metze, Florian",
    booktitle = "Proceedings of the 5th Workshop on Representation Learning for NLP",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.repl4nlp-1.19",
    pages = "156--165"
}

Steps to Replicate the Experiments

Please read the Readme's in the individual directories for the corresponding experiments, starting with word_evaluation and using the generated emeddings in sentence_evaluation and mt.

About

Code for "On Dimensional Linguistic Properties of the Word Embedding Space".

https://www.aclweb.org/anthology/2020.repl4nlp-1.19/


Languages

Language:Python 89.1%Language:Shell 7.7%Language:Perl 2.2%Language:sed 1.0%