Dataset

I have found these dataset in research papers.

Image

Classification or Recognition or Generative

Coil-20

http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php
MS COCO

http://mscoco.org/dataset/#overview
NVIDIA food Image classification

https://github.com/corona10/FoodDataSet
CIFAR-10, CIFAR-100

https://www.cs.toronto.edu/~kriz/cifar.html
Large-scale CelebFaces Attributes (CelebA) Dataset

http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
Street View House Numbers (SVHN)

http://ufldl.stanford.edu/housenumbers/
MNIST

http://yann.lecun.com/exdb/mnist/
Facial Database

http://www.face-rec.org/databases/
Simple Vector Drawing Datasets

https://github.com/hardmaru/sketch-rnn-datasets
Places2 (공간 사진, 정보 데이터)

http://places2.csail.mit.edu/download.html
Yelp dataset (식당 정보, 사진)

https://www.yelp.com/dataset_challenge
DeepFashion

http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html
Image to Latex (수식 이미지를 latex 코드로 만드는 데이터셋입니다.)

https://zenodo.org/record/56198#.WTpQ73XyhPN
NIST Dataset(Fingerprint, Mugshot, OCR)

https://www.nist.gov/srd/nist-special-database-4
Biometics ideal test dataset(Iris, Fingerprint, Face, palmprint, handwriting etc. - 로그인 필요!)

http://biometrics.idealtest.org/index.jsp

Medical

Lung cancer dataset

https://luna.grand-challenge.org

https://www.kaggle.com/c/data-science-bowl-2017
Brain tumor dataset

http://braintumorsegmentation.org
Breast cancer dataset (kaggle)

https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
The cancer image archive

http://www.cancerimagingarchive.net
Mammograpy dataset

http://marathon.csee.usf.edu/Mammography/Database.html
Bio Image Dataset @ IIIT Delhi

http://www.iab-rubric.org/resources.html
CAMELYON 16 - metatstasis detection in lymph node

https://camelyon16.grand-challenge.org/
CAMELYON17 Dataset https://camelyon17.grand-challenge.org/

Video

YouTube-BoundingBoxes Dataset

https://research.google.com/youtube-bb/index.html
Youtube-8M Dataset

https://research.google.com/youtube8m/
The Kinetics Human Action Video Dataset

https://deepmind.com/research/open-source/open-source-datasets/kinetics/

Text

Machine Translation

StatMT(Machine Translation, summarization 등의 태스크를 위한 데이터셋으로 나라-나라 쌍의 데이터셋입니다.)

http://www.statmt.org/wmt14/translation-task.html

http://www.statmt.org/wmt15/translation-task.html

http://www.statmt.org/wmt16/translation-task.html

http://www.statmt.org/wmt17/translation-task.html
UN parallel Corpus

https://conferences.unite.un.org/UNCorpus
IWSLT Dataset (including TED Translation)

https://sites.google.com/site/iwsltevaluation2016/
The Stacks Project(대수기하학 책의 원본과 latex 코드 pair set?)

http://stacks.math.columbia.edu/
Google sentence compression(Google에서 문장을 정형화 한 데이터입니다.)

http://storage.googleapis.com/sentencecomp/compression-data.json
조선왕조실록(한글/한문 번역)

http://sillok.history.go.kr/main/main.do

Categorical & Topic modeling

20 Newsgroups

http://qwone.com/~jason/20Newsgroups/
Reuter dataset

https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection

Short text

Tweet data, a subset of TREC 2011 microblog track

http://trec.nist.gov/data/tweets/
Title data, including news titles with class labels from some news websites

http://www.sogou.com/al

QA

bAbI dataset (Facebook Question Answering)

https://research.facebook.com/research/babi/
Question/Answering(빈칸추론문제) pairs using CNN/Daily Mail articles

https://github.com/deepmind/rc-data
Stanford Question Answering Dataset

https://rajpurkar.github.io/SQuAD-explorer/
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

http://cs.stanford.edu/people/jcjohns/clevr/
WikiReading dataset

https://github.com/google-research-datasets/wiki-reading

Word Embedding

Word2Vec에 쓰인 데이터셋(위키피디아, WMT11 등) https://code.google.com/archive/p/word2vec/
Fast Text pre-trained vector set

https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

Sentiment Analysis

Stanford Sentiment Treebank(SST)

http://nlp.stanford.edu/sentiment/

Sound

Nottingham music dataset

https://www-labs.iro.umontreal.ca/~lisa/deep/data/
A large-scale dataset of manually annotated audio events (Google research)

https://research.google.com/audioset/

Knowledge Base

Freebase

https://datahub.io/ko_KR/dataset/freebase
Wordnet

https://wordnet.princeton.edu/
Microsoft Concept Graph

https://concept.msra.cn/Home/Download
DBPedia Dataset

The DBpedia data set uses a large multi-domain ontology which has been derived from Wikipedia as well as localized versions of DBpedia in more than 100 languages.

http://wiki.dbpedia.org/services-resources/datasets/dbpedia-datasets
Yago

YAGO3 is a huge semantic knowledge base, derived from Wikipedia WordNet and GeoNames.

https://datahub.io/ko_KR/dataset/yago
Google Knowledge graph API

https://developers.google.com/knowledge-graph/

Social Networks & Recomendationdation

AMiner - Datasets for social network Analysis

https://cn.aminer.org/data
Netflix Prize Data Set

http://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9a
논문 bibliography 데이터셋, Author Citation Networks

https://aminer.org/citation

http://dblp.uni-trier.de/

https://aminer.org/citation

http://www.cs.cornell.edu/projects/kddcup/datasets.html
Politics sub redit

http://snap.stanford.edu/data/politics_subreddit.tar.gz
Amazon dataset

http://snap.stanford.edu/data/amazon-meta.html
Twitter Spammer network

http://twitter.mpi-sws.org/spam/
Twitter tweets

http://snap.stanford.edu/data/twitter7.html
Online reviews

http://snap.stanford.edu/data/#reviews

Pre-trained Model

Word2Vect

https://code.google.com/archive/p/word2vec/
GloVe

https://nlp.stanford.edu/projects/glove/
FastText

https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

국내 데이터셋

SKT Bigdata hub

https://www.bigdatahub.co.kr/index.do

ETC.

Titanic survivors dataset

https://goo.gl/P9CMFY
Obama’s political speeches

https://github.com/samim23/obama-rnn
Yahoo Finance dataset

https://finance.yahoo.com/quote/GOOG/history?ltr=1
Linux code

https://github.com/torvalds/linux
NYC Taxi dataset

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
US Census dataset

https://www.census.gov/topics/income-poverty/income/data/datasets.html

jooit / Dataset

Dataset

Image

Classification or Recognition or Generative

Medical

Video

Text

Machine Translation

Categorical & Topic modeling

Short text

QA

Word Embedding

Sentiment Analysis

Sound

Knowledge Base

Social Networks & Recomendationdation

Pre-trained Model

국내 데이터셋

ETC.

About