THUKElab / CLEME

The repository of EMNLP 2023 "CLEME: Debiasing Multi-reference Evaluation for Grammatical Error Correction"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CLEME: Debiasing Multi-reference Evaluation for Grammatical Error Correction

The repository contains the codes and data for our EMNLP 2023 Main Paper: CLEME: Debiasing Multi-reference Evaluation for Grammatical Error Correction.

CLEME is a reference-based metric that evaluate Grammatical Error Correction (GEC) systems at the chunk-level, aiming to provide unbiased F0.5 scores for GEC multi-reference evaluation.

Features

  • CLEME is unbiased, allowing a more objective evaluation pipeline.
  • CLEME is able to visualize evaluation pipeline as tables.
  • CLEME supports English and Chinese for now. We will extend to other languages in the future.

Requirements and Installation

Usage

CLI

Evaluate AMU system

python scripts/evaluate.py --ref tests/examples/conll14.errant --hyp tests/examples/conll14-AMU.errant

{'num_sample': 1312, 'F': 0.2514, 'Acc': 0.7634, 'P': 0.2645, 'R': 0.2097, 'tp': 313.51, 'fp': 871.8, 'fn': 1181.71, 'tn': 6312.0}

Visualize evaluation process as tables

python scripts/evaluate.py  --ref tests/examples/demo.errant  --hyp tests/examples/demo-AMU.errant  --vis

API

Evaluate AMU system using CLEME-dependent

# Read M2 file
dataset_ref = self.reader.read(f"{os.path.dirname(__file__)}/examples/demo.errant")
dataset_hyp = self.reader.read(f"{os.path.dirname(__file__)}/examples/demo-AMU.errant")
print(len(dataset_ref), len(dataset_hyp))
print("Example of reference", dataset_ref[-1])
print("Example of hypothesis", dataset_hyp[-1])

# Evaluate using CLEME_dependent
config_dependent = {
	"tp": {"alpha": 2.0, "min_value": 0.75, "max_value": 1.25, "reverse": False},
	"fp": {"alpha": 2.0, "min_value": 0.75, "max_value": 1.25, "reverse": True},
	"fn": {"alpha": 2.0, "min_value": 0.75, "max_value": 1.25, "reverse": False},
}
metric_dependent = DependentChunkMetric(weigher_config=config_dependent)
score, results = metric_dependent.evaluate(dataset_hyp, dataset_ref)
print(f"==================== Evaluate Demo ====================")
print(score)

# Visualize
metric_dependent.visualize(dataset_ref, dataset_hyp)

Refer to ./tests/test_cleme.py for more details.

Adapt to Other Languages

CLEME is language-agnostic, so you can easily employ CLEME for any languages if you have got reference and hypothesis M2 files.

Recommended Hyper-parameters

We search optimal hyper-parameters on CoNLL-2014 reference set, which are listed in .cleme/constant.py.

Citation

@article{ye-et-al-2023-cleme,
  title   = {CLEME: Debiasing Multi-reference Evaluation for Grammatical Error Correction},
  author  = {Ye, Jingheng and Li, Yinghui and Zhou, Qingyu and Li, Yangning and Ma, Shirong and Zheng, Hai-Tao and Shen, Ying},
  journal = {arXiv preprint arXiv:2305.10819},
  year    = {2023}
}

Update Logs

v1.0 (2023.11.15)

CLEME v1.0 released.

Contact & Feedback

If you have any questions or feedbacks, please send e-mails to ours: yejh22@mails.tsinghua.edu.cn, liyinghu20@mails.tsinghua.edu.cn

About

The repository of EMNLP 2023 "CLEME: Debiasing Multi-reference Evaluation for Grammatical Error Correction"

License:Apache License 2.0


Languages

Language:Python 100.0%