lingjiechen2 / expunations

This repository provides the dataset used in "ExPUNations: Augmenting puns with keywords and explanations" by Jiao Sun, Anjali Narayan-Chen, Shereen Oraby, Alessandra Cervone, Tagyoung Chung, Jing Huang, Yang Liu, and Nanyun Peng.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ExPUNations: Augmenting puns with keywords and explanations

Overview

This repository includes the annotated dataset from "ExPUNations: Augmenting puns with keywords and explanations" appearing at EMNLP 2022 (paper available on amazon.science or arXiv).

The original SemEval 2017 Task 7 dataset (Miller et al., 2017) contains puns that are either homographic (exploiting polysemy) or heterographic (exploiting phonological similarity to another word). The dataset also contains examples of non-pun text. We sample 1,999 text samples from SemEval 2017 Task 7 as the basis for our humor annotation. Full details on the data collection can be found in the paper (see Citation section).

Sample Instance

The excerpt below shows a sample data instance:

{
        "ID": "hom_362",
        "Understand the text?": [
            1,
            1,
            1,
            1,
            0
        ],
        "Offensive/Inappropriate?": [
            0,
            1,
            0,
            0,
            ""
        ],
        "Is a Joke?": [
            1,
            0,
            1,
            0,
            0
        ],
        "Funniness (1-5)": [
            2,
            0,
            1,
            0,
            0
        ],
        "Natural language explanation": [
            "Talking about being true as in being real or they will be fake/false teeth ",
            "",
            "False teeth are something people who lose their teeth may have, and being true to your teeth may be a way of saying take care of them otherwise you'll lose them. ",
            "",
            ""
        ],
        "Joke keywords": [
            [
                "true",
                "teeth",
                "false"
            ],
            [
                ""
            ],
            [
                "be true",
                "teeth",
                "false to you"
            ],
            [
                ""
            ],
            [
                ""
            ]
        ],
        "Annotator_IDs": [
            0,
            1,
            2,
            3,
            4
        ]
    }

Description of Fields

  • ID: ID of the original text in the SemEval 2017 Task 7 dataset.
  • Understand the text? (AF1): Whether each annotator understood the text or not, regardless of whether they perceived it as funny (1 for understand, 0 for didn't understand).
  • Offensive/Inappropriate? (AF2): Whether each annotator found the text offensive or inappropriate (1 for offensive/inappropriate, 0 otherwise). May contain missing values as empty strings if an annotator didn't understand the text.
  • Is a Joke? (AF3): Whether each annotator thought the text is intended to be a joke (1 for joke, 0 for non-joke). Defaults to 0 for any text that an annotator couldn't understand or rated as offensive/inappropriate.
  • Funniness (1-5) (AF4): each annotator's rating of the joke funniness on a Likert scale of 1-5, where 1 means very not funny and 5 means very funny. Defaults to 0 if the annotator rated this text as "not a joke".
  • Natural language explanation (AF5): Each annotator's explanation in concise natural language about why this joke is funny. Represented as an empty string if an annotator rated the text as not a joke.
  • Joke keywords (AF6): Each annotator's keyword phrases from the joke that are related to the punchline/the reason the joke is funny, each represented as a list of strings. Defaults to a list containing an empty string if the annotator rated the text as "not a joke".
  • Annotator_IDs: Anonymized annotator IDs corresponding to each annotator.

Data Files

In this repository, we release the full dataset of 1,899 annotated samples (5 annotators per sample). We also release the first 100 samples annotated as a calibrating pilot (3 annotators per sample, and missing AF3, which was added in later rounds of annotation).

├── data
   ├── expunations_annotated_full.json (full dataset)
   └── expunations_annotated_pilot_100.json (pilot)

Security

See CONTRIBUTING for more information.

License

This library is licensed under the CC-BY-NC-4.0 License (see LICENSE).

Citation

If using this dataset in any relevant work, please cite the following papers:

@inproceedings{sun2022expun,
  title = {Ex{PUN}ations: Augmenting Puns with Keywords and Explanations},
  author = {Sun, Jiao and Narayan-Chen, Anjali and Oraby, Shereen and Cervone, Alessandra and Chung, Tagyoung and Huang, Jing and Liu, Yang and Peng, Nanyun},
  booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2022}
}
@inproceedings{miller-etal-2017-semeval,
    title = "{S}em{E}val-2017 Task 7: Detection and Interpretation of {E}nglish Puns",
    author = "Miller, Tristan  and
      Hempelmann, Christian  and
      Gurevych, Iryna",
    booktitle = "Proceedings of the 11th International Workshop on Semantic Evaluation ({S}em{E}val-2017)",
    month = aug,
    year = "2017",
    address = "Vancouver, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/S17-2005",
    doi = "10.18653/v1/S17-2005",
    pages = "58--68",
    abstract = "A pun is a form of wordplay in which a word suggests two or more meanings by exploiting polysemy, homonymy, or phonological similarity to another word, for an intended humorous or rhetorical effect. Though a recurrent and expected feature in many discourse types, puns stymie traditional approaches to computational lexical semantics because they violate their one-sense-per-context assumption. This paper describes the first competitive evaluation for the automatic detection, location, and interpretation of puns. We describe the motivation for these tasks, the evaluation methods, and the manually annotated data set. Finally, we present an overview and discussion of the participating systems{'} methodologies, resources, and results.",
}

About

This repository provides the dataset used in "ExPUNations: Augmenting puns with keywords and explanations" by Jiao Sun, Anjali Narayan-Chen, Shereen Oraby, Alessandra Cervone, Tagyoung Chung, Jing Huang, Yang Liu, and Nanyun Peng.

License:Other