language-modeling machine-learning word-embeddings

lexpart

Companion code for "Toward a Thermodynamics of Meaning," CHR 2020
Official: http://ceur-ws.org/Vol-2723/short40.pdf
arXiv: https://arxiv.org/abs/2009.11963

This contains a simple reference implementation of a lingusitic partition function as described in the paper, with some limited documentation.

Installation

The repository is pip-installable:

pip install git+https://github.com/senderle/lexpart#egg=lexpart

Usage Example

To train an embedding based on the included test dataset (enwiki8), run the following commands:

python -m lexpart vocab vocab.npz -
python -m lexpart corpus corpus.npz vocab.npz -
python -m lexpart embed embed.npz corpus.npz
python -m lexpart wordsim embed.npz paris

This will print out a list of words in the corpus that are similar to "paris."

To train an embedding based on your own corpus, replace the - in the above commands with the path to a folder containing plain text files.

Mathematical Fine Print

The model described in the paper is based on the grand canonical partition function for multiple species in its standard form:

Z = ∑_i e^{β(µ₁N_1,i + µ₂N_2,i + ... + µ_kN_k,i − E_i)}

For computational purposes, however, it's convenient to represent the partition function in another form. Substituting u_k for e^βμ_k, we can rewrite the above like so:

Z = ∑_i u₁^N_1,i u₂^N_2,i ... u_k^N_k,i e^−βE_i

If we cheat a bit by treating the energy term (e^−βE_i) as a constant for all i, we can treat the partition function as one huge polynomial. Each term in the polynomial represents a sentence as a bag of words, where the exponent is the word count. Since counts for sentences are sparse, and differentiation is a linear operator, we can calculate values for the Jacobian and Hessian very efficiently. The code that performs this calculation is in sparsehess.py.

There are some interesting connections between this way of thinking about sentences and contexts in natural language and the way of thinking about data types described in Conor McBride's "The Derivative of a Regular Type is its Type of One-Hole Contexts."

About

Companion code for "Toward a Thermodynamics of Meaning," CHR 2020

http://ceur-ws.org/Vol-2723/short40.pdf

language-modeling machine-learning word-embeddings

MIT License

Languages

Language:Python 100.0%