WUGs

Scripts to process Word Usage Graphs (WUGs).

If you use this software for academic research, please cite these papers:

Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
Dominik Schlechtweg. 2023. Human and Computational Measurement of Lexical Semantic Change. PhD thesis. University of Stuttgart.

Find WUG data sets on the WUGsite.

Usage

Under scripts/ we provide a pipeline creating and clustering graphs and extracting data from them (e.g. change scores). Assuming you are working on a UNIX-based system, first make the scripts executable with

chmod 755 scripts/*.sh

Then run one of the following commands for Usage-Usage Graphs (UUGs) and Usage-Sense Graphs (USGs) respectively:

bash -e scripts/run_uug.sh
bash -e scripts/run_usg.sh

For the alternative pipeline with multiple possible clustering algorithms (Correlation Clustering, Weighted Stochastic Block Model, Chinese Whispers, Louvain method) and custom plotting functionalities, instead run:

bash -e scripts/run_uug2.sh

There are two scripts for external use with the DURel annotation tool allowing to specify input directory and other parameters from the command line (find usage examples in test.sh:

bash -e scripts/run_system.sh $dir ...
bash -e scripts/run_system2.sh $dir ...

Attention: modifies graphs iteratively, i.e., current run is dependent on previous run. Script deletes previously written data to avoid dependence. Important: The script uses simple test parameters; in order to improve the clustering load parameters_opt.sh in run_uug.sh or run_usg.sh.

We recommend you to run the scripts within a Python Anaconda environment. You have two options:

Create and activate the conda environment yourself, and then install the required packages with conda env update --file packages.yml.
Run source install_packages.sh. This will create the conda environment and install all required packages.

Both installation options were tested on Linux. You can test if your installation is working by running

bash -e test.sh

After installation, please check whether pygraphviz was installed correctly. There have been recurring errors with pygraphviz installation across operating systems. If an error occurs, you can check this page for solutions. On Linux, installing graphviz through the package manager is recommended.

Description

data2join.py: joins annotated data
data2annotators.py: extracts mapping from users to (anonymized) annotators
data2agr.py: computes agreement on full data
use2graph.py: adds uses to graph
sense2graph.py: adds senses to graph, for usage-sense graphs
sense2node.py: adds sense annotation data to nodes, if available
judgments2graph.py: adds judgments to graph
graph2cluster.py: clusters graph
extract_clusters.py: extract clusters from graph
graph2stats.py: extracts statistics from graph, including change scores
graph2plot.py: plots interactive graph in 2D

Please find the parameters for the current optimized WUG versions in parameters_opt.sh. Note that the parameters for the SemEval versions in parameters_semeval.sh will only roughly reproduce the published versions, because of non-deterministic clustering and small changes in the cleaning as well as clustering procedure.

For annotating and plotting your own graphs we recommend to use the DURel Tool.

Additional scripts and data

misc/usim2data.sh: downloads USim data and converts it to WUG format
misc/make_release.sh: create data for publication from pipeline output (compare to format of published data sets on WUGsite)

Input

For usage-usage graphs:

uses: find examples at test_uug/data/*/uses.csv
judgments: find examples at test_uug/data/*/judgments.csv

For usage-sense graphs:

uses: find examples at test_usg/data/*/uses.csv
senses: find examples at test_usg/data/*/senses.csv
judgments: find examples at test_usg/data/*/judgments.csv

Note: The column 'identifier' in each uses.csv should identify each word usage uniquely across all words.

Input Format

The uses.csv files must contain one use per line with the following fields specified as header and separated by :

<lemma>\t<pos>\t<date>\t<grouping>\t<identifier>\t<description>\t<context>\t<indexes_target_token>\t<indexes_target_sentence>\n

The CSV files should inlcude one empty line at the end. You can use this example as a guide (ignore additional columns). The files can contain additional columns including more information such as language, lemmatization, etc.

Find information on the individual fields below:

lemma: the lemma form of the target word in the respective word use
pos: the POS tag if available (else put space character)
date: the date of the use if available (else put space character)
grouping: any string assigning uses to groups (e.g. time-periods, corpora or dialects)
identifier: an identifier unique to each use across lemmas. We recommend to use this format: filename-sentenceno-tokenno
description: any additional information on the use if available (else put space character)
context: the text of the use. This will be shown to annotators.
indexes_target_token: The character indexes of the target token in context (Python list ranges as used in slicing, e.g. 17:25)
indexes_target_sentence: The character indexes of the target sentence (containing the target token) in context (e.g. 0:30 if context contains only one sentence, or 10:45 if it contains additional surrounding sentences). The part of the context beyond the specified character range will be marked as background in gray.

The judgments.csv files must contain one use pair judgment per line with the following fields specified as header and separated by :

<identifier1>\t<identifier2>\t<annotator>\t<judgment>\t<comment>\t<lemma>\n

Find information on the individual fields below:

identifier1: identifier of the first use in the use pair (must correspond to identifier in uses.csv)
identifier2: identifier of the second use in the use pair
annotator: annotator name
judgment: annotator judgment on graded scale (e.g. 1 for unrelated, 4 for identical)
comment: the annotator's comment (if any)
lemma: the lemma form of the target word in both uses

BibTex

@inproceedings{Schlechtweg2021dwug,
 title = {{DWUG}: A large Resource of Diachronic Word Usage Graphs in Four Languages},
 author = {Schlechtweg, Dominik  and Tahmasebi, Nina  and Hengchen, Simon  and Dubossarsky, Haim  and McGillivray, Barbara},
 booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
 publisher = {Association for Computational Linguistics},
 address = {Online and Punta Cana, Dominican Republic},
 pages = {7079--7091},
 url = {https://aclanthology.org/2021.emnlp-main.567},
 year = {2021}
}

@phdthesis{Schlechtweg2023measurement,
  author  = "Schlechtweg, Dominik",
  title   = "Human and Computational Measurement of Lexical Semantic Change",
  school  = "University of Stuttgart",
  address = "Stuttgart, Germany",
  url = {http://dx.doi.org/10.18419/opus-12833}
  year    = 2023
}

Garrafao / WUGs