chrishokamp/wcep-mds-dataset

WCEP Dataset

Overview

The WCEP dataset for multi-document summarization (MDS) consists of short, human-written summaries about news events, obtained from the Wikipedia Current Events Portal (WCEP), each paired with a cluster of news articles associated with an event. These articles consist of sources cited by editors on WCEP, and are extended with articles automatically obtained from the Common Crawl News dataset. For more information about the dataset and experiments, check out our ACL 2020 paper: A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal:

Dataset Generation

We currently do not provide the entire dataset for download. Instead, we share the summaries from WCEP and scripts that obtain the associated news articles. Make sure to set --jobs to your avaible number of CPUs to speed things up. Both scripts can be interrupted and resumed by just repeating the same command. To restart from scratch, add --override.

At first, download the inital dataset without articles, place it in /data (unzipped).

1) Extracting articles from WCEP

This script extracts news articles from various news sources cited on WCEP using newspaper3k from the Internet Archive Wayback Machine. We previously requested snapshots of all source articles that were not archived yet.

python extract_wcep_articles.py \
    --i data/initial_dataset.jsonl \
    --o data/wcep_articles.jsonl \
    --batchsize 200 \
    --jobs 16 \
    --repeat-failed

If any downloads fail due to timeouts, simply repeat the same command. It will only attempt to extract the missing articles.

2) Extracting articles from Common Crawl

This script extracts articles from Common Crawl News, which is divided into ~6000 files of 1GB size each. These are downloaded and searched one at a time. The relevant articles are extracted from HTML in parallel using newspaper3k.

python extract_cc_articles.py \
    --storage data/cc_storage \
    --dataset data/initial_dataset.jsonl \
    --batchsize 200 \
    --max-cluster-size 100 \
    --jobs 16

This process takes a long time (few days!). We are working on speeding it up. --max-cluster-size 100 already reduces the time: only up to 100 articles of each cluster in the dataset are extracted. This corresponds to the dataset version used in the experiments in our paper ("WCEP-100").

3) Combine and split

Finally, we need to group articles and summaries belonging together, and split the dataset into separate train/validation/test files. If --max-cluster-size was used in the previous step, use that here accordingly.

python combine_and_split.py \
    --dataset data/initial_dataset.jsonl \
    --cc-articles data/cc_storage/cc_articles.jsonl \
    --wcep-articles data/wcep_articles.jsonl \
    --max-cluster-size 100 \
    --o data/wcep_dataset

Loading the dataset

We store the dataset in a jsonl format, where each line corresponds to a news event, associated with a summary and a cluster of news articles, and some metadata, such as date and category. The summarization task is to reconstruct the summary from the news articles.

import json

def read_jsonl(path):
    with open(path) as f:
        for line in f:
            yield json.loads(line)

val_data = list(read_jsonl('data/wcep_dataset/val.jsonl'))
c = val_data[404]
summary = c['summary'] # human-written summary
articles = c['articles'] # cluster of articles

Citation

If you find this dataset useful, please cite:

@inproceedings{gholipour-ghalandari-etal-2020-large,
    title = "A Large-Scale Multi-Document Summarization Dataset from the {W}ikipedia Current Events Portal",
    author = "Gholipour Ghalandari, Demian  and
      Hokamp, Chris  and
      Pham, Nghia The  and
      Glover, John  and
      Ifrim, Georgiana",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.120",
    pages = "1302--1308",
}

chrishokamp / wcep-mds-dataset