I have found these dataset in research papers.
-
Coil-20
http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php
-
MS COCO
-
NVIDIA food Image classification
-
CIFAR-10, CIFAR-100
-
Large-scale CelebFaces Attributes (CelebA) Dataset
-
Street View House Numbers (SVHN)
-
MNIST
-
Facial Database
-
Simple Vector Drawing Datasets
-
Places2 (공간 사진, 정보 데이터)
-
Yelp dataset (식당 정보, 사진)
-
DeepFashion
-
Image to Latex (수식 이미지를 latex 코드로 만드는 데이터셋입니다.)
-
NIST Dataset(Fingerprint, Mugshot, OCR)
-
Biometics ideal test dataset(Iris, Fingerprint, Face, palmprint, handwriting etc. - 로그인 필요!)
-
Lung cancer dataset
-
Brain tumor dataset
-
Breast cancer dataset (kaggle)
-
The cancer image archive
-
Mammograpy dataset
-
Bio Image Dataset @ IIIT Delhi
-
CAMELYON 16 - metatstasis detection in lymph node
-
CAMELYON17 Dataset https://camelyon17.grand-challenge.org/
-
YouTube-BoundingBoxes Dataset
-
Youtube-8M Dataset
-
The Kinetics Human Action Video Dataset
https://deepmind.com/research/open-source/open-source-datasets/kinetics/
-
StatMT(Machine Translation, summarization 등의 태스크를 위한 데이터셋으로 나라-나라 쌍의 데이터셋입니다.)
http://www.statmt.org/wmt14/translation-task.html
http://www.statmt.org/wmt15/translation-task.html
-
UN parallel Corpus
-
IWSLT Dataset (including TED Translation)
-
The Stacks Project(대수기하학 책의 원본과 latex 코드 pair set?)
-
Google sentence compression(Google에서 문장을 정형화 한 데이터입니다.)
http://storage.googleapis.com/sentencecomp/compression-data.json
-
조선왕조실록(한글/한문 번역)
-
20 Newsgroups
-
Reuter dataset
https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection
-
Tweet data, a subset of TREC 2011 microblog track
-
Title data, including news titles with class labels from some news websites
-
bAbI dataset (Facebook Question Answering)
-
Question/Answering(빈칸추론문제) pairs using CNN/Daily Mail articles
-
Stanford Question Answering Dataset
-
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
-
WikiReading dataset
-
Word2Vec에 쓰인 데이터셋(위키피디아, WMT11 등) https://code.google.com/archive/p/word2vec/
-
Fast Text pre-trained vector set
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
-
Stanford Sentiment Treebank(SST)
-
Nottingham music dataset
-
A large-scale dataset of manually annotated audio events (Google research)
-
Freebase
-
Wordnet
-
Microsoft Concept Graph
-
DBPedia Dataset
The DBpedia data set uses a large multi-domain ontology which has been derived from Wikipedia as well as localized versions of DBpedia in more than 100 languages.
http://wiki.dbpedia.org/services-resources/datasets/dbpedia-datasets
-
Yago
YAGO3 is a huge semantic knowledge base, derived from Wikipedia WordNet and GeoNames.
-
Google Knowledge graph API
-
AMiner - Datasets for social network Analysis
-
Netflix Prize Data Set
http://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9a
-
논문 bibliography 데이터셋, Author Citation Networks
-
Politics sub redit
-
Amazon dataset
-
Twitter Spammer network
-
Twitter tweets
-
Online reviews
-
Word2Vect
-
GloVe
-
FastText
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
-
SKT Bigdata hub
-
Titanic survivors dataset
-
Obama’s political speeches
-
Yahoo Finance dataset
-
Linux code
-
NYC Taxi dataset
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
-
US Census dataset
https://www.census.gov/topics/income-poverty/income/data/datasets.html