akutuzov / WUGs

Code to process Word Usage Graphs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

WUGs

Scripts to process Word Usage Graphs (WUGs).

If you use this software for academic research, please cite these papers:

Find WUG data sets on the WUGsite.

Usage

Under scripts/ we provide a pipeline creating and clustering graphs and extracting data from them (e.g. change scores). Assuming you are working on a UNIX-based system, first make the scripts executable with

chmod 755 scripts/*.sh

Then run one of the following commands for Usage-Usage Graphs (UUGs) and Usage-Sense Graphs (USGs) respectively:

bash -e scripts/run_uug.sh
bash -e scripts/run_usg.sh

Attention: modifies graphs iteratively, i.e., current run is dependent on previous run. Script deletes previously written data to avoid dependence.

We recommend you to run the scripts within a virtual environment with Python 3.9.5. Install the required packages running pip install -r requirements.txt.

Input

For usage-usage graphs:

  • uses: find examples at test_uug/data/*/uses.csv
  • judgments: find examples at test_uug/data/*/judgments.csv

For usage-sense graphs:

  • uses: find examples at test_usg/data/*/uses.csv
  • senses: find examples at test_usg/data/*/senses.csv
  • judgments: find examples at test_usg/data/*/judgments.csv

Note: The column 'identifier' in each uses.csv should identify each word to usage uniquely across all words.

Description

  • data2join.py: joins annotated data
  • data2annotators.py: extracts mapping from users to (anonymized) annotators
  • data2agr.py: computes agreement on full data
  • use2graph.py: adds uses to graph
  • sense2graph.py: adds senses to graph, for usage-sense graphs
  • sense2node.py: adds sense annotation data to nodes, if available
  • judgments2graph.py: adds judgments to graph
  • exclude_nodes.py: excludes nodes with many invalid judgments, removes invalid edges
  • graph2cluster.py: clusters graph
  • extract_clusters.py: extract clusters from graph
  • graph2stats.py: extracts statistics from graph, including change scores

Please find the parameters for the current optimized WUG versions in parameters_opt.sh. Note that the parameters for the SemEval versions in parameters_semeval.sh will only roughly reproduce the published versions, because of non-deterministic clustering and small changes in the cleaning as well as clustering procedure.

For annotating and plotting your own graphs we recommend to use the DURel Tool.

Additional scripts

  • usim2data.sh: downloads USim data and converts it to WUG format

BibTex

@article{Schlechtweg2021dwug,
	title = {{DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages}},
	author = "Schlechtweg, Dominik and Tahmasebi, Nina and Hengchen, Simon and Dubossarsky, Haim and McGillivray, Barbara",
	year = {2021},
	journal   = {CoRR},
	volume    = {abs/2104.08540},
	archivePrefix = {arXiv},
	eprint    = {2104.08540},
	url = {https://arxiv.org/abs/2104.08540}
}
@inproceedings{Schlechtweg2021wugs,
	title = {{Clustering Word Usage Graphs: A Flexible Framework to Measure Changes in Contextual Word Meaning}},
	author = {Schlechtweg, Dominik and {Schulte im Walde}, Sabine},
	year = {submitted}
}

About

Code to process Word Usage Graphs

License:GNU General Public License v3.0


Languages

Language:Python 91.9%Language:Shell 8.1%