Repository for pre-processing code related to generating the training datasets used in the paper.
The repository contains 4 notebooks:
- preprocessing_ner_tagging.ipynb: This notebook will allow you to tag all the entities using FLAIR in the source and summary for both CNN/DailyMail and NYT datasets
- preprocessing_coref_resolution.ipynb: This notebook takes the entities from the previous entity tagging and performs Coreference Resolution using SpanBERT so we don't have duplicate data points for a given entity
- preprocessing_bertsum.ipynb: This notebook uses the files generated by the NER tagging and Coreference Resolution to generate the training dataset to be used to train a BERTSum model
- preprocessing_gsum.ipynb: This notebook uses the files generated by the NER tagging and Coreference Resolution to generate the training dataset to be used to train a GSum model
CNN/DailyMail and NYT are datasets that can be used for training models by setting up entity-centric summarization datasets with methods described in the paper and by leveraging the notebooks mentioned above.
The EntSUM dataset is used to evaluate the effectiveness of these trained entity-centric summarization models.
- EntSUM Zenodo | HuggingFace
The EntSUM code is distributed under the Apache License (version 2.0); see the LICENSE file at the top of the source tree for more information.
Note: To run the code and download the datasets, please obtain the respective licenses for each respectively.
@inproceedings{maddela-etal-2022-entsum,
title = "{E}nt{SUM}: A Data Set for Entity-Centric Extractive Summarization",
author = "Maddela, Mounica and
Kulkarni, Mayank and
Preotiuc-Pietro, Daniel",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.237",
pages = "3355--3366",
abstract = "Controllable summarization aims to provide summaries that take into account user-specified aspects and preferences to better assist them with their information need, as opposed to the standard summarization setup which build a single generic summary of a document.We introduce a human-annotated data set EntSUM for controllable summarization with a focus on named entities as the aspects to control.We conduct an extensive quantitative analysis to motivate the task of entity-centric summarization and show that existing methods for controllable summarization fail to generate entity-centric summaries. We propose extensions to state-of-the-art summarization approaches that achieve substantially better results on our data set. Our analysis and results show the challenging nature of this task and of the proposed data set.",
}