EntSUM: A dataset for entity centric summarization

Repository for pre-processing code related to generating the training datasets used in the paper.

Using this repository

The repository contains 4 notebooks:

preprocessing_ner_tagging.ipynb: This notebook will allow you to tag all the entities using FLAIR in the source and summary for both CNN/DailyMail and NYT datasets
preprocessing_coref_resolution.ipynb: This notebook takes the entities from the previous entity tagging and performs Coreference Resolution using SpanBERT so we don't have duplicate data points for a given entity
preprocessing_bertsum.ipynb: This notebook uses the files generated by the NER tagging and Coreference Resolution to generate the training dataset to be used to train a BERTSum model
preprocessing_gsum.ipynb: This notebook uses the files generated by the NER tagging and Coreference Resolution to generate the training dataset to be used to train a GSum model

Datasets

CNN/DailyMail and NYT are datasets that can be used for training models by setting up entity-centric summarization datasets with methods described in the paper and by leveraging the notebooks mentioned above.

The EntSUM dataset is used to evaluate the effectiveness of these trained entity-centric summarization models.

EntSUM Zenodo | HuggingFace

License

The EntSUM code is distributed under the Apache License (version 2.0); see the LICENSE file at the top of the source tree for more information.

Note: To run the code and download the datasets, please obtain the respective licenses for each respectively.

Citation

@inproceedings{maddela-etal-2022-entsum,
    title = "{E}nt{SUM}: A Data Set for Entity-Centric Extractive Summarization",
    author = "Maddela, Mounica  and
      Kulkarni, Mayank  and
      Preotiuc-Pietro, Daniel",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.237",
    pages = "3355--3366",
    abstract = "Controllable summarization aims to provide summaries that take into account user-specified aspects and preferences to better assist them with their information need, as opposed to the standard summarization setup which build a single generic summary of a document.We introduce a human-annotated data set EntSUM for controllable summarization with a focus on named entities as the aspects to control.We conduct an extensive quantitative analysis to motivate the task of entity-centric summarization and show that existing methods for controllable summarization fail to generate entity-centric summaries. We propose extensions to state-of-the-art summarization approaches that achieve substantially better results on our data set. Our analysis and results show the challenging nature of this task and of the proposed data set.",
}

bloomberg / entsum

EntSUM: A dataset for entity centric summarization

Using this repository

Datasets

License

Citation

About

Languages