yuanheTian / Datasets-for-Question-Answering

Will be updated continuously

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Datasets for Question Answering (QA)

A collection of datasets used in QA tasks. Solely for Natural Language Processing (NLP). Categorization based on the language. The datasets are sorted by year of publication.

English

1. MCTEST

  • Dataset: https://mattr1.github.io/mctest/data.html
  • Publication: https://aclanthology.org/D13-1020.pdf (2013, EMNLP)
  • Abstract: MCTest is a freely available set of stories and associated questions intended for research on the machine comprehension of text. MCTest requires machines to answer multiple-choice reading comprehension questions about fictional stories, directly tackling the high-level goal of open-domain machine comprehension.

2. WikiQA

3. SQuAD (v1.0)

4. CNN/Daily Mail

  • Dataset: https://github.com/abisee/cnn-dailymail
  • Publication: https://arxiv.org/pdf/1602.06023v5.pdf (2016, CONLL)
  • Abstract: CNN/Daily Mail is a dataset for text summarization. The corpus has 286,817 training pairs, 13,368 validation pairs and 11,487 test pairs, as defined by their scripts. The source documents in the training set have 766 words spanning 29.74 sentences on an average while the summaries consist of 53 words and 3.72 sentences.

5. CHILDREN’S BOOK TEST (CBT)

6. BOOK TEST (BT)

7. TriviaQA

8. RACE

  • Dataset: https://www.cs.cmu.edu/~glai1/data/race/
  • Publication: https://arxiv.org/pdf/1704.04683v5.pdf (2017, EMNLP)
  • Abstract: Consists of 27,933 passages and 97,867 questions from English exams, targeting Chinese students aged 12-18. RACE consists of two subsets, RACE-M and RACE-H, from middle school and high school exams, respectively. RACE-M has 28,293 questions and RACE-H has 69,574. Each question is associated with 4 candidate answers, one of which is correct.

9. NewsQA

10. SearchQA

11. NarrativeQA

12. SQuAD (v2.0)

13. AI2 Reasoning Challenge (ARC)

  • Dataset: https://allenai.org/data/arc
  • Publication: https://arxiv.org/pdf/1803.05457v1.pdf (2018, arXiv)
  • Abstract: A multiple-choice question-answering dataset, containing questions from science exams from grade 3 to grade 9. The dataset is split in two partitions: Easy and Challenge. Most of the questions have 4 answer choices, with <1% of all the questions having either 3 or 5 answer choices. ARC includes a supporting KB of 14.3M unstructured text passages.

14. Natural Questions

15. MS Marco

  • Dataset: https://microsoft.github.io/msmarco/
  • Publication: https://arxiv.org/pdf/1611.09268v3.pdf (2019, NeurIPS)
  • Abstract: The first dataset was a question answering dataset featuring 100,000 real Bing questions and a human generated answer. Over time the collection was extended with a 1,000,000 question dataset, a natural language generation dataset, a passage ranking dataset, keyphrase extraction dataset, crawling dataset, and a conversational search.

16. CoQA

  • Dataset: https://stanfordnlp.github.io/coqa/
  • Publication: https://arxiv.org/pdf/1808.07042v2.pdf (2019, TACL)
  • Abstract: For Conversational Question Answering systems CoQA contains 127,000+ questions with answers collected from 8000+ conversations. Each conversation is collected by pairing two crowdworkers to chat about a passage in the form of questions and answers.

Chinese

1. ORCD

2. MATINF

  • Dataset: https://github.com/WHUIR/MATINF
  • Publication: https://arxiv.org/pdf/2004.12302v2.pdf (2020, ACL)
  • Abstract: Jointly labeled for classification, question answering and summarization in the domain of maternity and baby caring in Chinese. An entry in the dataset includes four fields: question (Q), description (D), class (C) and answer (A).

3. LiveQA

  • Dataset: https://github.com/PKU-TANGENT/LiveQA
  • Publication: https://arxiv.org/pdf/2010.00526.pdf (2020, CCL)
  • Abstract: It contains 117k multiple-choice questions written by human commentators for over 1,670 NBA games, which are collected from the Chinese Hupu website. In LiveQA, the questions require understanding the timeline, tracking events or doing mathematical computations.

Other Languages

1. JaQuAD

2. GermanQuAD

3. DaNetQA

4. MuSeRC

5. RuCoS

6. HeadQA

7. FQuAD

  • Dataset: https://fquad.illuin.tech/
  • Publication: https://arxiv.org/pdf/2002.06071v2.pdf (2020, EMNLP)
  • Coverage: French
  • Abstract: A French Native Reading Comprehension dataset of questions and answers on a set of Wikipedia articles that consists of 25,000+ samples for the 1.0 version and 60,000+ samples for the 1.1 version.

8. KLEJ

9. MilkQA

10. PersianQA

  • Dataset: https://github.com/sajjjadayobi/PersianQA
  • Publication: None
  • Coverage: Persian
  • Abstract: Based on Persian Wikipedia. The crowd-sourced the dataset consists of more than 9,000 entries. Each entry can be either an impossible-to-answer or a question with one or more answers spanning in the passage (the context) from which the questioner proposed the question.

Multi-lingual

1. FM-IQA

2. SuperGLUE

3. XQA

4. XQuAD

  • Dataset: https://github.com/deepmind/xquad
  • Publication: https://arxiv.org/pdf/1910.11856v3.pdf (2020, ACL)
  • Coverage: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi.
  • Abstract: XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from the development set of SQuAD v1.1

5. MLQA

  • Dataset: https://github.com/facebookresearch/mlqa
  • Publication: https://arxiv.org/pdf/1910.07475v3.pdf (2020, ACL)
  • Coverage: English, Arabic, German, Spanish, Hindi, Vietnamese, Chinese.
  • Abstract: MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages. MLQA is highly parallel, with QA instances parallel between 4 different languages on average.

6. RELX

7. MKQA

  • Dataset: https://github.com/apple/ml-mkqa/
  • Publication: https://arxiv.org/pdf/2007.15207v2.pdf (2020, arXiv)
  • Coverage: 26 languages
  • Abstract: Multilingual Knowledge Questions and Answers (MKQA) is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total).

About

Will be updated continuously