dfdx / MLDatasets.jl

Machine Learning Datasets for Julia

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MLDatasets.jl

Build Status

MLDatasets provides an access to common machine learning datasets for Julia. Currently, julia 0.5 is supported.

The datasets are automatically downloaded to the specified directory. The default directory is MLDatasets/datasets.

Installation

julia> Pkg.clone("https://github.com/JuliaML/MLDatasets.jl.git")

Basic Usage

using MLDatasets

train_x, train_y = MNIST.traindata()
test_x, test_y = MNIST.testdata()

Use traindata(<directory>) and testdata(<directory>) to change the default directory.

Available Datasets

Image Classification

CIFAR-10

The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes.

CIFAR-100

The CIFAR-100 dataset consists of 600 32x32 color images in 100 classes. The 100 classes are grouped into 20 superclasses (fine and coarse labels).

MNIST

The MNIST dataset consists of 60000 28x28 images of handwritten digits.

Take a look at the sub-module for more information

Fashion-MNIST

The Fashion-MNIST dataset consists of 60000 28x28 images of fashion products. It was designed to be a drop-in replacement for the MNIST dataset

Take a look at the sub-module for more information

Language Modeling

PTBLM

The PTBLM dataset consists of Penn Treebank sentences for language modeling, available from tomsercu/lstm. The unknown words are replaced with <unk> so that the total vocaburary size becomes 10000.

This is the first sentence of the PTBLM dataset.

x, y = PTBLM.traindata()

x[1]
> ["no", "it", "was", "n't", "black", "monday"]
y[1]
> ["it", "was", "n't", "black", "monday", "<eos>"]

where MLDataset adds the special word: <eos> to the end of y.

Text Analysis (POS-Tagging, Parsing)

UD English

The UD_English dataset is an annotated corpus of morphological features, POS-tags and syntactic trees. The dataset follows CoNLL-style format.

traindata = UD_English.traindata()
devdata = UD_English.devdata()
testdata = UD_English.devdata()

Data Size

Type Train x Train y Test x Test y
CIFAR-10 image 32x32x3x50000 50000 32x32x3x10000 10000
CIFAR-100 image 32x32x3x500 2x500 32x32x3x100 2x100
MNIST image 28x28x60000 60000 28x28x10000 10000
FashionMNIST image 28x28x60000 60000 28x28x10000 10000
PTBLM text 42068 42068 3761 3761
UD_English text 12543 - 2077 -

About

Machine Learning Datasets for Julia

License:MIT License


Languages

Language:Julia 100.0%