MLDatasets
provides an access to common machine learning
datasets for Julia. Currently, julia 0.5
is supported.
The datasets are automatically downloaded to the specified
directory. The default directory is MLDatasets/datasets
.
julia> Pkg.clone("https://github.com/JuliaML/MLDatasets.jl.git")
using MLDatasets
train_x, train_y = MNIST.traindata()
test_x, test_y = MNIST.testdata()
Use traindata(<directory>)
and testdata(<directory>)
to change the default directory.
The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes.
The CIFAR-100 dataset consists of 600 32x32 color images in 100 classes. The 100 classes are grouped into 20 superclasses (fine and coarse labels).
The MNIST dataset consists of 60000 28x28 images of handwritten digits.
Take a look at the sub-module for more information
The Fashion-MNIST dataset consists of 60000 28x28 images of fashion products. It was designed to be a drop-in replacement for the MNIST dataset
Take a look at the sub-module for more information
The PTBLM
dataset consists of Penn Treebank sentences for
language modeling, available from
tomsercu/lstm. The unknown
words are replaced with <unk>
so that the total vocaburary size
becomes 10000.
This is the first sentence of the PTBLM dataset.
x, y = PTBLM.traindata()
x[1]
> ["no", "it", "was", "n't", "black", "monday"]
y[1]
> ["it", "was", "n't", "black", "monday", "<eos>"]
where MLDataset
adds the special word: <eos>
to the end of y
.
The UD_English dataset is an annotated corpus of morphological features, POS-tags and syntactic trees. The dataset follows CoNLL-style format.
traindata = UD_English.traindata()
devdata = UD_English.devdata()
testdata = UD_English.devdata()
Type | Train x | Train y | Test x | Test y | |
---|---|---|---|---|---|
CIFAR-10 | image | 32x32x3x50000 | 50000 | 32x32x3x10000 | 10000 |
CIFAR-100 | image | 32x32x3x500 | 2x500 | 32x32x3x100 | 2x100 |
MNIST | image | 28x28x60000 | 60000 | 28x28x10000 | 10000 |
FashionMNIST | image | 28x28x60000 | 60000 | 28x28x10000 | 10000 |
PTBLM | text | 42068 | 42068 | 3761 | 3761 |
UD_English | text | 12543 | - | 2077 | - |