Awesome ML (Machine Learning)

Datasets

Over 39 million published research papers in Computer Science, Neuroscience, and Biomedical.

SQuAD (2016)

Stanford Question Answering Dataset (SQuAD) is a new reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage. With 100,000+ question-answer pairs on 500+ articles, SQuAD is significantly larger than previous reading comprehension datasets.

The Chinese Text Project is an online open-access digital library that makes pre-modern Chinese texts available to readers and researchers all around the world. The site attempts to make use of the digital medium to explore new ways of interacting with these texts that are not possible in print. With over thirty thousand titles and more than five billion characters, the Chinese Text Project is also the largest database of pre-modern Chinese texts in existence.

collected on 47,300 COCO images
In total, it has 327,939 QA pairs, together with 1,311,756 human-generated multiple-choices and 561,459 object groundings from 36,579 categories

Sequential vision-to-language, and explore how this data may be used for the task of visual storytelling.
The dataset includes 81,743 unique photos in 20,211 sequences, aligned to descriptive and story language.

Its English portion, PPDB:Eng, contains over 220 million paraphrase pairs, consisting of 73 million phrasal and 8 million lexical paraphrases, as well as 140 million paraphrase patterns, which capture many meaning-preserving syntactic transformations.

Wordbank contains data from 63,386 children and 71,003 CDI administrations, across 23 languages and 44 instruments

MIT License