gentaiscool/xquad

This directory contains the Cross-lingual Question Answering Dataset (XQuAD), described in the following paper (https://arxiv.org/abs/1910.11856):

Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. (2019). On the cross-lingual transferability of monolingual representations. arXiv 1910.11856.

The dataset consists of a subset of 240 context paragraphs and 1,190 question-answer pairs from the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their translations into 10 languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Please refer to the paper for a detailed description of how the dataset was created.

The files are in JSON format and contain the same fields as the original SQuAD 1.1 dataset described in the following paper:

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. (2016). SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.

The original SQuAD 1.1 dataset can be downloaded from https://github.com/rajpurkar/SQuAD-explorer/tree/master/dataset

This directory contains the following files:

Arabic: xquad.ar.json
German: xquad.de.json
Greek: xquad.el.json
English: xquad.en.json
Spanish: xquad.es.json
Hindi: xquad.hi.json
Russian: xquad.ru.json
Thai: xquad.th.json
Turkish: xquad.tr.json
Vietnamese: xquad.vi.json
Chinese: xquad.zh.json

This dataset is distributed under the CC BY-SA 4.0 license (https://creativecommons.org/licenses/by-sa/4.0/legalcode).

This is not an officially supported Google product.

If you use this dataset, please cite:

@article{Artetxe:etal:2019,
      author    = {Mikel Artetxe and Sebastian Ruder and Dani Yogatama},
      title     = {On the cross-lingual transferability of monolingual representations},
      journal   = {CoRR},
      volume    = {abs/1910.11856},
      year      = {2019},
      archivePrefix = {arXiv},
      eprint    = {1910.11856}
}

gentaiscool / xquad

About