pnpnpn / awesome-ml

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Awesome ML (Machine Learning)

Datasets

Google's dataset search

Semantic Scholar corpus

  • Over 39 million published research papers in Computer Science, Neuroscience, and Biomedical.

SQuAD (2016)

  • Stanford Question Answering Dataset (SQuAD) is a new reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage. With 100,000+ question-answer pairs on 500+ articles, SQuAD is significantly larger than previous reading comprehension datasets.

Chinese Text Project

  • The Chinese Text Project is an online open-access digital library that makes pre-modern Chinese texts available to readers and researchers all around the world. The site attempts to make use of the digital medium to explore new ways of interacting with these texts that are not possible in print. With over thirty thousand titles and more than five billion characters, the Chinese Text Project is also the largest database of pre-modern Chinese texts in existence.

OpenSubtitles (2016)

  • translated movie subtitles from http://www.opensubtitles.org/
  • 65 languages, 1,850 bitexts
  • total number of files: 2,793,243
  • total number of tokens: 17.09G
  • total number of sentence fragments: 2.60G

Visual7W visual question answering dataset (2016)

  • collected on 47,300 COCO images
  • In total, it has 327,939 QA pairs, together with 1,311,756 human-generated multiple-choices and 561,459 object groundings from 36,579 categories

Microsoft Sequential Image Narrative Dataset (SIND) (2016)

  • Sequential vision-to-language, and explore how this data may be used for the task of visual storytelling.
  • The dataset includes 81,743 unique photos in 20,211 sequences, aligned to descriptive and story language.

1 Billion Word Language Model Benchmark (2013)

  • Training/held-out data was produced from the WMT 2011 News Crawl data
  • unpruned Katz (1.1B n-grams),
  • pruned Katz (~15M n-grams),
  • unpruned Interpolated Kneser-Ney (1.1B n-grams),
  • pruned Interpolated Kneser-Ney (~15M n-grams)

Cornell movie-diaglogs corpus (2011)

  • 220,579 conversational exchanges between 10,292 pairs of movie characters
  • involves 9,035 characters from 617 movies
  • in total 304,713 utterances

PPDB: The Paraphrase Database (2013)

  • Its English portion, PPDB:Eng, contains over 220 million paraphrase pairs, consisting of 73 million phrasal and 8 million lexical paraphrases, as well as 140 million paraphrase patterns, which capture many meaning-preserving syntactic transformations.

Wordbank: An open database of children's vocabulary development (2015)

  • Wordbank contains data from 63,386 children and 71,003 CDI administrations, across 23 languages and 44 instruments

Projects

Deep Text Correcter (2016)

Courses

Talks