google-research-datasets / wiki-split

One million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

WikiSplit Dataset

One million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.

http://goo.gl/language/wiki-split

Update (3 June 2019): The source code for the evaluations in our paper has been released into a separate repository: https://github.com/google-research/google-research/tree/master/wiki_split_bleu_eval.

Description

Google's WikiSplit dataset was constructed automatically from the publicly available Wikipedia revision history. Although the dataset contains some inherent noise, it can serve as valuable training data for models that split or merge sentences.

For further details about the construction of the dataset and its use for model training, see the accompanying paper: Learning to Split and Rephrase From Wikipedia Edit History

If you use or discuss this dataset in your work, please cite our paper:

@InProceedings{BothaEtAl2018,
  title = {{Learning To Split and Rephrase From Wikipedia Edit History}},
  author = {Botha, Jan A and Faruqui, Manaal and Alex, John and Baldridge, Jason and Das, Dipanjan},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
  pages = {to appear},
  note = {arXiv preprint arXiv:1808.09468},
  year = {2018}
}

Examples

  • Due to the hurricane , Lobsterfest has been canceled , making Bob very happy about it and he decides to open Bob 's Burgers for customers who were planning on going to Lobsterfest .

    • Due to the hurricane , Lobsterfest has been canceled , making Bob ecstatic .
    • He decides to open Bob 's Burgers for customers who were planning on going to Lobsterfest .
  • Her family is rumored to be a large financial clique which controls the underworld of Japan , but rarely people know the unhappiness which she suffered for being born in such a troublesome family .

    • Her family is rumored to be a large financial clique which controls the underworld of Japan .
    • People are unaware of the unhappiness which she suffered for being born in such a troublesome family .

Data format

The dataset is released as text files formatted as tab-separated values (TSV) according to the following schema:

Column Meaning
1 unsplit single sentence
2 split-up sentences, delimited by the string <::::>

The sentences have already been tokenized on punctuation.

Example data item

Due to the hurricane , Lobsterfest has been canceled , making Bob very happy about it and he decides to open Bob 's Burgers for customers who were planning on going to Lobsterfest .	Due to the hurricane , Lobsterfest has been canceled , making Bob ecstatic . <::::> He decides to open Bob 's Burgers for customers who were planning on going to Lobsterfest .

Dataset statistics

Part Instances Tokens* Vocabulary*
train.tsv 989,944 33,084,465 632,588
tune.tsv 5,000 167,456 25,871
validation.tsv 5,000 166,628 25,251
test.tsv 5,000 167,853 25,386

*counted over the unsplit sentences

Result on WebSplit 1.0 Benchmark

Our paper introducing the WikiSplit dataset applied it to the split-and-rephrase task. The main result is that including WikiSplit during model training leads to improved generalization and dramatically better output on the WebSplit 1.0 test set.

Corpus BLEU
Source (i.e. just echoing the input sentence 58.7
Model trained on...
     WebSplit only (Aharoni & Goldberg, 2017) 30.5
     WebSplit + WikiSplit (Botha et al., 2018)* 62.4

See paper for details.

For the sake of direct comparisons to this work, the evaluation source code is available at: https://github.com/google-research/google-research/tree/master/wiki_split_bleu_eval.

License

The WikiSplit dataset is a verbatim copy of certain content from the publicly available Wikipedia revision history. The dataset is therefore licensed under CC BY-SA 4.0. Any third party content or data is provided "As Is" without any warranty, express or implied.

Contact

If you have a technical question regarding the dataset or publication, please create an issue in this repository.

About

One million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.