HANNA Benchmark Repository

Resources for the paper "Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation" accepted in COLING 2022.

Authors: Cyril Chhun, Pierre Colombo, Fabian Suchanek and Chloé Clavel.

Updates

24/08/2022 - Initial commit

Data

We release in this repository HANNA, a large annotated dataset of Human-ANnotated NArratives for ASG evaluation. HANNA contains annotations for 1,056 stories generated from 96 prompts from the WritingPrompts dataset. Each story was annotated by 3 raters on 6 criteria (Relevance, Coherence, Empathy, Surprise, Engagement and Complexity), for a grand total of 19,008 annotations. Additionally, we release the scores of those 1,056 stories evaluated by 72 automatic metrics.

hanna_stories_annotations.csv contains the raw annotations from our experiment.
- Story ID is the ID of the story (from 0 to 1,055). Stories are grouped by model (0 to 95 are the Human stories, 96 to 191 are the BertGeneration stories, etc.).
- Prompt is the prompt
- Human is the corresponding human story
- Story is the generated story
- Model is the model used to generate the story
- Relevance is the Relevance (RE) score
- Coherence is the Coherence (CH) score
- Empathy is the Empathy (EM) score
- Surprise is the Surprise (SU) score
- Engagement is the Engagement (EG) score
- Complexity is the Complexity (CX) score
- Worker ID is the ID of the mTurk worker
- Assignment ID is the ID of the mTurk assignment
- Work time in seconds is the time the worker spent on the assignment in seconds
- Name is the name entered by the worker for the first mentioned character in the story
hanna_metrics_scores.csv contains the averages of human annotations and scores of automatic metrics per story per system. For instance, on row 2, you will find the scores of the stories generated by the BertGeneration model. Each list of that row contains the scores of stories 96 to 191 for each metric.

Samples

Story ID	Prompt	Human	Story	Model	RE	CH	EM	SU	EG	CX
99	Write a story about an elderly wizard and his young female apprentice.	His body was failing. He had taken care of it very well, but 205 years were a long time. Not a drop of alcohol all those long and lonely tavern nights, not a crumb of tobacco for the old pipe. [...]	“Follow me,” his mentor said. “I must stop this wizard.” At that, Tawthorn drew his dagger and leaned towards the woman. “If you were correct, my professor was right. You could change the world, and save the kingdom.” [...]	BertGeneration	3	2	2	2	2	3
519	You are a immortal during the zombie apocalypse, During the apocalypse the zombies ignore you and you try to live a normal life during the outbreak.	50 years is a long time, enough time to go crazy and return sane. I remember before it happened, the CDC joked that they would have a cure “within a week” [...]	After a few weeks of running, you see something inside a tube/pulse generator. I woke up groggy. The day was Monday, it was Tuesday. How was my day going so fast? [...]	GPT-2	5	5	3	4	4	4
862	When a new president is elected, they are given a special security briefing. In reality, this is an old tradition where various directors, military officers and current ministers present fake evidence and compete to see who can convince the president of the most ridiculous things. [...]	“Mr President I want you to know I am telling you this in full confidence .” Said the head of the Secret Service. The President looked at him. “Yes go ahead .” [...]	“Mr. President, you can see this! You know what the problem is. You see, President Obama, in the US, has been working on the latest model of the President 's campaign for over two years! [...]	Fusion	2	1	1	1	1	1

Jupyter Notebook

We also provide the Jupyter Notebook data_visualization.ipynb containing the code we used to generate our results. It also allows for easier visualisation of the data from the csv files.

The folder rankings contains csv files used in the Ranking DFs section of the notebook. They can be regenerated by setting the RESET_RANKINGS variable to True.

Setup

The code was tested with Python 3.9.0. You can install the required packages with

pip install -r requirements.txt

You will also need the utils.py file from the RankingNLPSystems repository and the williams.py file from the nlp-williams repository for the Rankings DFs and Williams sections of the notebook respectively. We cannot include them in the repository for licensing reasons.

If you do not plan to run the cells of those sections, simply comment the corresponding imports in the first cell.

Citation

@inproceedings{chhun-etal-2022-human,
    title = "Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation",
    author = "Chhun, Cyril  and
      Colombo, Pierre  and
      Suchanek, Fabian M.  and
      Clavel, Chlo{\'e}",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.509",
    pages = "5794--5836",
    abstract = "Research on Automatic Story Generation (ASG) relies heavily on human and automatic evaluation. However, there is no consensus on which human evaluation criteria to use, and no analysis of how well automatic criteria correlate with them. In this paper, we propose to re-evaluate ASG evaluation. We introduce a set of 6 orthogonal and comprehensive human criteria, carefully motivated by the social sciences literature. We also present HANNA, an annotated dataset of 1,056 stories produced by 10 different ASG systems. HANNA allows us to quantitatively evaluate the correlations of 72 automatic metrics with human criteria. Our analysis highlights the weaknesses of current metrics for ASG and allows us to formulate practical recommendations for ASG evaluation.",
}

Acknowledgements

We thank Yejin Choi, Richard Bai, Le Fang, Jian Guan, Hannah Rashkin, David Wilmot and Eden Bensaid for answering to our requests for ASG data.

Dataset

WritingPrompts (Fan et al., 2018)

Used systems

BertGeneration (Rothe et al., 2020)
CTRL (Keskar et al., 2019)
GPT (Radford et al., 2018)
GPT-2 (Radford et al., 2019)
RoBERTa (Liu et al., 2019)
XLNet (Yang et al., 2019)
Fusion (Fan et al., 2018)
HINT (Guan et al., 2021)
TD-VAE (Wilmot et al., 2021)

Libraries

RankingNLPSystems (Colombo et al., 2022)
nlp-williams (Moon et al., 2019)

Funding

This work was granted access to the HPC resources of IDRIS under the allocation 2022-101838 made by GENCI and was partially funded by the grant ANR-20-CHIA-0012-01 (“NoRDF”).

Cyril, Fabian and Chloé are members of the NoRDF project.

Get Involved

Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!

dig-team / hanna-benchmark-asg