TimotheeMickus / mf-correl

Repository for "What Meaning-Form Correlation Has to Compose With"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What Meaning-Form Correlation Has to Compose With

This is the repository for the COLING 2020 paper "What Meaning-Form Correlation Has to Compose With".

Installing

An installation script is available to help you set up all external dependencies as required by the code: install.sh The process assumes a UNIX environment and access to functional installs of python 3.7 and python 2 as well as the virtualenv tool. Running the code also requires a working Java environment (tested with OpenJDK 11.0.6 2020-01-14).

NB: The installation script will download all dependencies, which might require significant space.

Structure

Code is stored in src/. The src/shared/ directory contains pieces of code shared by some or all experiments. Script starting with src/exp1_, src/exp2_ and src/exp3_ correspond to code for artificial language experiments, definition experiments and sentence experiments respectively. Subdirectory src/exp3_embs/ contains specifically scripts to compute or retrieve sentence embeddings.

Data is available under data/, subdirectories correspond to different experiments.

"Push-button" scripts are available to reproduce experiments: exp1.sh, exp2.sh, exp3.sh.

NB: Temp files produced for experiments 2 & 3 are very large (> 150Gb). Consider running part of the experiments or make sure you have dedicated free space.

Acknowledgments

We used Mantel tests from J. W. Carr's github (see here).

The implementation of APTED by Pawlik & Augsten is from their original repository (written in Java). The JAR we provide has been hacked to accept a file of pairs of trees at once, instead of a single pair. The single edited Java file is available for reference in the directory shared/apted/.

Pre-trained word embeddings are available from their original repositories, or at these links: Word2Vec, GloVe 6B and 840B, FastText

Sentence encoders are from the original repositories: SkipThoughts (written in python 2), InferSent. See also the original Google Hub for USE DAN and USE Transformer.

Lastly, the two datasets used to evaluate embeddings, the MEN by Bruni et al. and SICK by Baroni et al. can be retrieved from their respective homepages.

About

Repository for "What Meaning-Form Correlation Has to Compose With"


Languages

Language:Python 71.5%Language:Shell 15.7%Language:Java 12.8%