HANNA Benchmark Repository
Resources for the paper "Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation" accepted in COLING 2022.
Authors: Cyril Chhun, Pierre Colombo, Fabian Suchanek and Chloé Clavel.
Table of contents
Updates
24/08/2022 - Initial commit
Data
We release in this repository HANNA, a large annotated dataset of Human-ANnotated NArratives for ASG evaluation. HANNA contains annotations for 1,056 stories generated from 96 prompts from the WritingPrompts dataset. Each story was annotated by 3 raters on 6 criteria (Relevance, Coherence, Empathy, Surprise, Engagement and Complexity), for a grand total of 19,008 annotations. Additionally, we release the scores of those 1,056 stories evaluated by 72 automatic metrics.
hanna_stories_annotations.csv
contains the raw annotations from our experiment.Story ID
is the ID of the story (from 0 to 1,055). Stories are grouped by model (0 to 95 are the Human stories, 96 to 191 are the BertGeneration stories, etc.).Prompt
is the promptHuman
is the corresponding human storyStory
is the generated storyModel
is the model used to generate the storyRelevance
is the Relevance (RE) scoreCoherence
is the Coherence (CH) scoreEmpathy
is the Empathy (EM) scoreSurprise
is the Surprise (SU) scoreEngagement
is the Engagement (EG) scoreComplexity
is the Complexity (CX) scoreWorker ID
is the ID of the mTurk workerAssignment ID
is the ID of the mTurk assignmentWork time in seconds
is the time the worker spent on the assignment in secondsName
is the name entered by the worker for the first mentioned character in the story
hanna_metrics_scores.csv
contains the averages of human annotations and scores of automatic metrics per story per system. For instance, on row 2, you will find the scores of the stories generated by the BertGeneration model. Each list of that row contains the scores of stories 96 to 191 for each metric.
Samples
Story ID | Prompt | Human | Story | Model | RE | CH | EM | SU | EG | CX |
---|---|---|---|---|---|---|---|---|---|---|
99 | Write a story about an elderly wizard and his young female apprentice. | His body was failing. He had taken care of it very well, but 205 years were a long time. Not a drop of alcohol all those long and lonely tavern nights, not a crumb of tobacco for the old pipe. [...] | “Follow me,” his mentor said. “I must stop this wizard.” At that, Tawthorn drew his dagger and leaned towards the woman. “If you were correct, my professor was right. You could change the world, and save the kingdom.” [...] | BertGeneration | 3 | 2 | 2 | 2 | 2 | 3 |
519 | You are a immortal during the zombie apocalypse, During the apocalypse the zombies ignore you and you try to live a normal life during the outbreak. | 50 years is a long time, enough time to go crazy and return sane. I remember before it happened, the CDC joked that they would have a cure “within a week” [...] | After a few weeks of running, you see something inside a tube/pulse generator. I woke up groggy. The day was Monday, it was Tuesday. How was my day going so fast? [...] | GPT-2 | 5 | 5 | 3 | 4 | 4 | 4 |
862 | When a new president is elected, they are given a special security briefing. In reality, this is an old tradition where various directors, military officers and current ministers present fake evidence and compete to see who can convince the president of the most ridiculous things. [...] | “Mr President I want you to know I am telling you this in full confidence .” Said the head of the Secret Service. The President looked at him. “Yes go ahead .” [...] | “Mr. President, you can see this! You know what the problem is. You see, President Obama, in the US, has been working on the latest model of the President 's campaign for over two years! [...] | Fusion | 2 | 1 | 1 | 1 | 1 | 1 |
Jupyter Notebook
We also provide the Jupyter Notebook data_visualization.ipynb
containing the code we used to generate our results. It also allows for easier visualisation of the data from the csv
files.
The folder rankings
contains csv
files used in the Ranking DFs
section of the notebook. They can be regenerated by setting the RESET_RANKINGS
variable to True
.
Setup
The code was tested with Python 3.9.0. You can install the required packages with
pip install -r requirements.txt
You will also need the utils.py
file from the RankingNLPSystems repository and the williams.py
file from the nlp-williams repository for the Rankings DFs
and Williams
sections of the notebook respectively. We cannot include them in the repository for licensing reasons.
If you do not plan to run the cells of those sections, simply comment the corresponding imports in the first cell.
Citation
@inproceedings{chhun-etal-2022-human,
title = "Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation",
author = "Chhun, Cyril and
Colombo, Pierre and
Suchanek, Fabian M. and
Clavel, Chlo{\'e}",
booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
month = oct,
year = "2022",
address = "Gyeongju, Republic of Korea",
publisher = "International Committee on Computational Linguistics",
url = "https://aclanthology.org/2022.coling-1.509",
pages = "5794--5836",
abstract = "Research on Automatic Story Generation (ASG) relies heavily on human and automatic evaluation. However, there is no consensus on which human evaluation criteria to use, and no analysis of how well automatic criteria correlate with them. In this paper, we propose to re-evaluate ASG evaluation. We introduce a set of 6 orthogonal and comprehensive human criteria, carefully motivated by the social sciences literature. We also present HANNA, an annotated dataset of 1,056 stories produced by 10 different ASG systems. HANNA allows us to quantitatively evaluate the correlations of 72 automatic metrics with human criteria. Our analysis highlights the weaknesses of current metrics for ASG and allows us to formulate practical recommendations for ASG evaluation.",
}
Acknowledgements
We thank Yejin Choi, Richard Bai, Le Fang, Jian Guan, Hannah Rashkin, David Wilmot and Eden Bensaid for answering to our requests for ASG data.
Dataset
WritingPrompts (Fan et al., 2018)
Used systems
- BertGeneration (Rothe et al., 2020)
- CTRL (Keskar et al., 2019)
- GPT (Radford et al., 2018)
- GPT-2 (Radford et al., 2019)
- RoBERTa (Liu et al., 2019)
- XLNet (Yang et al., 2019)
- Fusion (Fan et al., 2018)
- HINT (Guan et al., 2021)
- TD-VAE (Wilmot et al., 2021)
Libraries
- RankingNLPSystems (Colombo et al., 2022)
- nlp-williams (Moon et al., 2019)
Funding
This work was granted access to the HPC resources of IDRIS under the allocation 2022-101838 made by GENCI and was partially funded by the grant ANR-20-CHIA-0012-01 (“NoRDF”).
Cyril, Fabian and Chloé are members of the NoRDF project.
Get Involved
Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!