MLDatasets.jl

MLDatasets provides an access to common machine learning datasets for Julia. Currently, julia 0.5 is supported.

The datasets are automatically downloaded to the specified directory. The default directory is MLDatasets/datasets.

Installation

julia> Pkg.clone("https://github.com/JuliaML/MLDatasets.jl.git")

Basic Usage

using MLDatasets

train_x, train_y = MNIST.traindata()
test_x, test_y = MNIST.testdata()

Use traindata(<directory>) and testdata(<directory>) to change the default directory.

Available Datasets

Image Classification

CIFAR-10

The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes.

CIFAR-100

The CIFAR-100 dataset consists of 600 32x32 color images in 100 classes. The 100 classes are grouped into 20 superclasses (fine and coarse labels).

MNIST

The MNIST dataset consists of 60000 28x28 images of handwritten digits.

Take a look at the sub-module for more information

Fashion-MNIST

The Fashion-MNIST dataset consists of 60000 28x28 images of fashion products. It was designed to be a drop-in replacement for the MNIST dataset

Take a look at the sub-module for more information

Language Modeling

PTBLM

The PTBLM dataset consists of Penn Treebank sentences for language modeling, available from tomsercu/lstm. The unknown words are replaced with <unk> so that the total vocaburary size becomes 10000.

This is the first sentence of the PTBLM dataset.

x, y = PTBLM.traindata()

x[1]
> ["no", "it", "was", "n't", "black", "monday"]
y[1]
> ["it", "was", "n't", "black", "monday", "<eos>"]

where MLDataset adds the special word: <eos> to the end of y.

Text Analysis (POS-Tagging, Parsing)

UD English

The UD_English dataset is an annotated corpus of morphological features, POS-tags and syntactic trees. The dataset follows CoNLL-style format.

traindata = UD_English.traindata()
devdata = UD_English.devdata()
testdata = UD_English.devdata()

Data Size

	Type	Train x	Train y	Test x	Test y
CIFAR-10	image	32x32x3x50000	50000	32x32x3x10000	10000
CIFAR-100	image	32x32x3x500	2x500	32x32x3x100	2x100
MNIST	image	28x28x60000	60000	28x28x10000	10000
FashionMNIST	image	28x28x60000	60000	28x28x10000	10000
PTBLM	text	42068	42068	3761	3761
UD_English	text	12543	-	2077	-

dfdx / MLDatasets.jl