neulab / wikiasp

Code for WikiAsp: Multi-document aspect-based summarization.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

WikiAsp: A Dataset for Multi-domain Aspect-based Summarization

This repository contains the dataset from the paper "WikiAsp: A Dataset for Multi-domain Aspect-based Summarization".

WikiAsp is a multi-domain, aspect-based summarization dataset in the encyclopedic domain. In this task, models are asked to summarize cited reference documents of a Wikipedia article into aspect-based summaries. Each of the 20 domains include 10 domain-specific pre-defined aspects.

wikiasp

Dataset

Download

WikiAsp is a available via 20 zipped archives, each of which corresponds to a domain. More than 28GB of storage space is necessary to download and store all the domains (unzipped). The following command will download all of them and extract archives:

./scripts/download_and_extract_all.sh /path/to/save_directory

Alternatively, one can individually download an archive for each domain from the table below. (Note: left-clicking will not prompt downloading dialogue. Open the link in a new tab, or save from the context menu on your OS, or use wget.)

Domain Link Size (unzipped)
Album Download 2.3GB
Animal Download 589MB
Artist Download 2.2GB
Building Download 1.3GB
Company Download 1.9GB
EducationalInstitution Download 1.9GB
Event Download 900MB
Film Download 2.8GB
Group Download 1.2GB
HistoricPlace Download 303MB
Infrastructure Download 1.3GB
MeanOfTransportation Download 792MB
OfficeHolder Download 2.0GB
Plant Download 286MB
Single Download 1.5GB
SoccerPlayer Download 721MB
Software Download 1.3GB
TelevisionShow Download 1.1GB
Town Download 932MB
WrittenWork Download 1.8GB

Format

Each domain includes three files {train,valid,test}.jsonl, and each line represents one instance in JSON format. Each instance forms the following structure:

{
    "exid": "train-1-1",
    "input": [  
        "tokenized and uncased sentence_1 from document_1",
        "tokenized and uncased sentence_2 from document_1",
        "...",
        "tokenized and uncased sentence_i from document_j",
        "..."
    ],
    "targets": [ 
        ["a_1", "tokenized and uncased aspect-based summary for a_1"],
        ["a_2", "tokenized and uncased aspect-based summary for a_2"],
        "..."
    ]
}

where,

  • exid: str
  • input: List[str]
  • targets: List[Tuple[str,str]]

Here, input is the cited references and consists of tokenized sentences (with NLTK). The targets key points to a list of aspect-based summaries, where each element is a pair of a) the target aspect and b) the aspect-based summary.

Inheriting from the base corpus, this dataset exhibits the following characteristics:

  • Cited references are composed of multiple documents, but the document boundaries are lost, thus expressed simply in terms of list of sentences.
  • Sentences in the cited references (input) are tokenized using NLTK.
  • The number of target summaries for each instance varies.

Citation

If you use the dataset, please consider citing with

@article{hayashi20tacl,
    title = {WikiAsp: A Dataset for Multi-domain Aspect-based Summarization},
    author = {Hiroaki Hayashi and Prashant Budania and Peng Wang and Chris Ackerson and Raj Neervannan and Graham Neubig},
    journal = {Transactions of the Association for Computational Linguistics (TACL)},
    month = {},
    url = {https://arxiv.org/abs/2011.07832},
    year = {2020}
}

LICENSE

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

About

Code for WikiAsp: Multi-document aspect-based summarization.


Languages

Language:Shell 100.0%