Scripts to process Word Usage Graphs (WUGs).
If you use this software for academic research, please cite these papers:
- Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages.
- Dominik Schlechtweg and Sabine Schulte im Walde. submitted. Clustering Word Usage Graphs: A Flexible Framework to Measure Changes in Contextual Word Meaning.
Find WUG data sets on the WUGsite.
Under scripts/
we provide a pipeline creating and clustering graphs and extracting data from them (e.g. change scores). Assuming you are working on a UNIX-based system, first make the scripts executable with
chmod 755 scripts/*.sh
Then run one of the following commands for Usage-Usage Graphs (UUGs) and Usage-Sense Graphs (USGs) respectively:
bash -e scripts/run_uug.sh
bash -e scripts/run_usg.sh
Attention: modifies graphs iteratively, i.e., current run is dependent on previous run. Script deletes previously written data to avoid dependence.
We recommend you to run the scripts within a virtual environment with Python 3.9.5. Install the required packages running pip install -r requirements.txt
.
For usage-usage graphs:
- uses: find examples at
test_uug/data/*/uses.csv
- judgments: find examples at
test_uug/data/*/judgments.csv
For usage-sense graphs:
- uses: find examples at
test_usg/data/*/uses.csv
- senses: find examples at
test_usg/data/*/senses.csv
- judgments: find examples at
test_usg/data/*/judgments.csv
Note: The column 'identifier' in each uses.csv
should identify each word to usage uniquely across all words.
data2join.py
: joins annotated datadata2annotators.py
: extracts mapping from users to (anonymized) annotatorsdata2agr.py
: computes agreement on full datause2graph.py
: adds uses to graphsense2graph.py
: adds senses to graph, for usage-sense graphssense2node.py
: adds sense annotation data to nodes, if availablejudgments2graph.py
: adds judgments to graphexclude_nodes.py
: excludes nodes with many invalid judgments, removes invalid edgesgraph2cluster.py
: clusters graphextract_clusters.py
: extract clusters from graphgraph2stats.py
: extracts statistics from graph, including change scores
Please find the parameters for the current optimized WUG versions in parameters_opt.sh
. Note that the parameters for the SemEval versions in parameters_semeval.sh
will only roughly reproduce the published versions, because of non-deterministic clustering and small changes in the cleaning as well as clustering procedure.
For annotating and plotting your own graphs we recommend to use the DURel Tool.
usim2data.sh
: downloads USim data and converts it to WUG format
@article{Schlechtweg2021dwug,
title = {{DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages}},
author = "Schlechtweg, Dominik and Tahmasebi, Nina and Hengchen, Simon and Dubossarsky, Haim and McGillivray, Barbara",
year = {2021},
journal = {CoRR},
volume = {abs/2104.08540},
archivePrefix = {arXiv},
eprint = {2104.08540},
url = {https://arxiv.org/abs/2104.08540}
}
@inproceedings{Schlechtweg2021wugs,
title = {{Clustering Word Usage Graphs: A Flexible Framework to Measure Changes in Contextual Word Meaning}},
author = {Schlechtweg, Dominik and {Schulte im Walde}, Sabine},
year = {submitted}
}