WikiAsp: A Dataset for Multi-domain Aspect-based Summarization

This repository contains the dataset from the paper "WikiAsp: A Dataset for Multi-domain Aspect-based Summarization".

WikiAsp is a multi-domain, aspect-based summarization dataset in the encyclopedic domain. In this task, models are asked to summarize cited reference documents of a Wikipedia article into aspect-based summaries. Each of the 20 domains include 10 domain-specific pre-defined aspects.

Dataset

Download

WikiAsp is a available via 20 zipped archives, each of which corresponds to a domain. More than 28GB of storage space is necessary to download and store all the domains (unzipped). The following command will download all of them and extract archives:

./scripts/download_and_extract_all.sh /path/to/save_directory

Alternatively, one can individually download an archive for each domain from the table below. (Note: left-clicking will not prompt downloading dialogue. Open the link in a new tab, or save from the context menu on your OS, or use wget.)

Domain	Link	Size (unzipped)
Album	Download	2.3GB
Animal	Download	589MB
Artist	Download	2.2GB
Building	Download	1.3GB
Company	Download	1.9GB
EducationalInstitution	Download	1.9GB
Event	Download	900MB
Film	Download	2.8GB
Group	Download	1.2GB
HistoricPlace	Download	303MB
Infrastructure	Download	1.3GB
MeanOfTransportation	Download	792MB
OfficeHolder	Download	2.0GB
Plant	Download	286MB
Single	Download	1.5GB
SoccerPlayer	Download	721MB
Software	Download	1.3GB
TelevisionShow	Download	1.1GB
Town	Download	932MB
WrittenWork	Download	1.8GB

Format

Each domain includes three files {train,valid,test}.jsonl, and each line represents one instance in JSON format. Each instance forms the following structure:

{
    "exid": "train-1-1",
    "input": [  
        "tokenized and uncased sentence_1 from document_1",
        "tokenized and uncased sentence_2 from document_1",
        "...",
        "tokenized and uncased sentence_i from document_j",
        "..."
    ],
    "targets": [ 
        ["a_1", "tokenized and uncased aspect-based summary for a_1"],
        ["a_2", "tokenized and uncased aspect-based summary for a_2"],
        "..."
    ]
}

where,

exid: str
input: List[str]
targets: List[Tuple[str,str]]

Here, input is the cited references and consists of tokenized sentences (with NLTK). The targets key points to a list of aspect-based summaries, where each element is a pair of a) the target aspect and b) the aspect-based summary.

Inheriting from the base corpus, this dataset exhibits the following characteristics:

Cited references are composed of multiple documents, but the document boundaries are lost, thus expressed simply in terms of list of sentences.
Sentences in the cited references (input) are tokenized using NLTK.
The number of target summaries for each instance varies.

Citation

If you use the dataset, please consider citing with

@article{hayashi20tacl,
    title = {WikiAsp: A Dataset for Multi-domain Aspect-based Summarization},
    author = {Hiroaki Hayashi and Prashant Budania and Peng Wang and Chris Ackerson and Raj Neervannan and Graham Neubig},
    journal = {Transactions of the Association for Computational Linguistics (TACL)},
    month = {},
    url = {https://arxiv.org/abs/2011.07832},
    year = {2020}
}

LICENSE

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

About

Code for WikiAsp: Multi-document aspect-based summarization.

Languages

Language:Shell 100.0%