Swiss knife for machine learning.
This package does not implement specific machine learning algorithms. Instead, it provides a collection of useful tools to support machine learning programs, including:
- Data manipulation
- Standardization
- Score-based classification
- Cross validation
- Performance evaluation (e.g. evaluating ROC)
- Computing deviation between arrays
-
repeach(a, n)
Repeat each element in vector
a
forn
times. Heren
can be either a scalar or a vector with the same length asa
.using MLBase repeach(1:3, 2) # --> [1, 1, 2, 2, 3, 3] repeach(1:3, [3,2,1]) # --> [1, 1, 1, 2, 2, 3]
-
repeachcol(a, n)
Repeat each column in matrix
a
forn
times. Heren
can be either a scalar or a vector withlength(n) == size(a,2)
. -
repeachrow(a, n)
Repeat each row in matrix
a
forn
times. Heren
can be either a scalar or a vector withlength(n) == size(a,1)
. -
counteq(a, b)
Count the number of occurences of
a[i] == b[i]
. -
countne(a, b)
Count the number of occurrences of
a[i] != b[i]
.
Sometimes, it might be desirable to standardize a set of data before feeding it to a machine learning task (e.g. PCA), in order to balance the contributions of different components.
The package provides a Standardize
type to capture the standardization transform, which is defined as below:
immutable Standardize
dim::Int
mean::Vector{Float64}
scale::Vector{Float64}
end
Applying a standardization transform t
to a vector x
is defined as:
y[i] = t.scale[i] * (x[i] - t.mean[i])
Here, t.scale[i]
is the inverse of the standard deviation of the i-th variable. After standarization, each component would have zero mean and unit standard deviation.
Note we allow either mean
or scale
fields to be empty, which indicates that the step of shifting the mean or that of scaling the component would not be applied.
-
estimate(Standardize, X[; center=true, scale=true])
Estimate a standardization transform from a given data set
X
.This package follows the convention that each column of
X
is an observation and each row is a component/variable. -
standardize(X[; center=true, scale=true])
Estimate a standardization transform from
X
and apply it toX
. It returns a pair(Y, t)
, whereY
is the transformed data matrix, andt
is an instance ofStandardize
that represents the estimated transform. -
standardize!(X[; center=true, scale=true])
Similar to
standardize
, except that the transformation toX
happens inplace. -
transform(t, X)
Apply a standardization transform
t
toX
, return the transformed vector/matrix. -
transform!(t, X)
Apply a standardization transform
t
toX
inplace, returnX
.
In machine learning, we often need to first attach each class with an integer label. This package provides a type LabelMap
that captures the association between discrete values (e.g a finite set of strings) and integer labels.
Together with LabelMap
, the package also provides a function labelmap
to construct the map from a sequence of discrete values, and a function labelencode
to map discrete values to integer labels.
julia> lm = labelmap(["a", "a", "b", "b", "c"])
LabelMap (with 3 labels):
[1] a
[2] b
[3] c
julia> labelencode(lm, "b")
2
julia> labelencode(lm, ["a", "c", "b"])
3-element Array{Int64,1}:
1
3
2
Note that labelencode
can be applied to either single value or an array.
The package also provides a function groupindices
to group indices based on associated labels.
julia> groupindices(3, [1, 1, 1, 2, 2, 3, 2])
3-element Array{Array{Int64,1},1}:
[1,2,3]
[4,5,7]
[6]
# using lm as constructed above
julia> groupindices(lm, ["a", "a", "c", "b", "b"])
3-element Array{Array{Int64,1},1}:
[1,2]
[4,5]
[3]
No matter how sophisticated a classification framework is, the entire classification task generally consists of two steps: (1) assign a score/distance to each class, and (2) choose the class that yields the highest score/lowest distance.
This package provides a function classify
and its friends to accomplish the second step.
-
classify(x[, ord])
Classify based on scores given in
x
and the order of scores specified inord
.Generally,
ord
can be any instance of typeOrdering
. However, it usually enough to use eitherForward
orReverse
:ord = Forward
: higher value indicates better match (e.g., similarity)ord = Reverse
: lower value indicates better match (e.g., distances)
When
ord
is omitted, it is defaulted toForward
.When
x
is a vector, it produces an integer label. Whenx
is a matrix, it produces a vector of integers, each for a column ofx
.classify([0.2, 0.5, 0.3]) # --> 2 classify([0.2, 0.5, 0.3], Forward) # --> 2 classify([0.2, 0.5, 0.3], Reverse) # --> 1 classify([0.2 0.5 0.3; 0.7 0.6 0.2]') # --> [2, 1] classify([0.2 0.5 0.3; 0.7 0.6 0.2]', Forward) # --> [2, 1] classify([0.2 0.5 0.3; 0.7 0.6 0.2]', Reverse) # --> [1, 3]
-
classify!(r, x[, ord])
Write predicted labels to
r
. -
classify_withscore(x[, ord])
Return a pair as
(label, score)
, wherescore
is the input score corresponding to the predicted label. -
classify_withscores(x[, ord])
This function applies to a matrix
x
comprised of multiple samples (each being a column). It returns a pair(labels, scores)
. -
classify_withscores!(r, s, x[, ord])
Write predicted labels to
r
and corresponding scores tos
.
This package implements several cross validation schemes: Kfold
, LOOCV
, and RandomSub
. Each scheme is an iterable object, of which each element is a vector of indices (indices of samples selected for training).
-
Kfold(n, k)
k
-fold cross validation over a set ofn
samples, which are randomly partitioned intok
disjoint subsets of nearly the same sizes.julia> collect(Kfold(10, 3)) 3-element Array{Any,1}: [1,2,7] [4,5,8,9] [3,6,10]
-
LOOCV(n)
Leave-one-out cross validation over a set of
n
samples.julia> collect(LOOCV(4)) 4-element Array{Any,1}: [2,3,4] [1,3,4] [1,2,4] [1,2,3]
-
RandomSub(n, sn, k)
Repetitively random subsampling. Particularly, this generates
k
subsets of lengthsn
from a data set withn
samples.julia> collect(RandomSub(10, 5, 3)) 3-element Array{Any,1}: [1,2,5,8,9] [2,5,7,8,10] [1,3,5,6,7]
The package also provides a function cross_validate
as below to run a cross validation procedure.
-
cross_validate(estfun, evalfun, n, gen, ord)
Run a cross validation procedure.
-
estfun
: estimation function, which takes a vector of training indices as input and returns a learned model, asmodel = estfun(train_inds)
-
evalfun
: evaluation function, which takes a model and a vector of testing indices as input and returns a score that indicates the goodness of the model, asscore = evalfun(model, test_inds)
-
n
: the total number of samples -
gen
: an iterable object that provides training indices, e.g., a cross validation scheme as listed above. -
ord
: the ordering of the evaluated score.ord = Forward
means that higher score indicates better model;ord = Reverse
means that lower score indicates better model.
This function returns a tuple as
(best_model, best_score, best_indices)
.Here is a full example:
# A simple example to demonstrate the use of cross validation # # Here, we consider a simple model: using a mean vector to represent # a set of samples. The goodness of the model is assessed in terms # of the RMSE (root-mean-square-error) evaluated on the testing set # using MLBase # functions compute_center(X::Matrix{Float64}) = vec(mean(X, 2)) compute_rmse(c::Vector{Float64}, X::Matrix{Float64}) = sqrt(mean(sum(abs2(X .- c),1))) # data const n = 200 const data = [2., 3.] .+ randn(2, n) # cross validation (c, v, inds) = cross_validate( inds -> compute_center(data[:, inds]), # training function (c, inds) -> compute_rmse(c, data[:, inds]), # evaluation function n, # total number of samples Kfold(n, 5), # cross validation plan: 5-fold Reverse) # smaller score indicates better model
Please refer to
examples/crossval.jl
for the entire script. -
This package provides tools to assess the performance of a machine learning algorithm.
-
correctrate(gt, pred)
Compute correct rate of predictions given by
pred
w.r.t. the ground truths given ingt
. -
errorrate(gt, pred)
Compute error rate of predictions given by
pred
w.r.t. the ground truths given ingt
.
-
hitrate(gt, ranklist, k)
Compute the hitrate of rank
k
for a ranked list of predictions given byranklist
w.r.t. the ground truths given ingt
.Particularly, if
gt[i]
is contained inranklist[1:k, i]
, then the prediction for thei
-th sample is said to be hit within rankk
. The hitrate of rankk
is the fraction of predictions that hit within rankk
. -
hitrates(gt, ranklist, ks)
Compute hit-rates of multiple ranks (as given by a vector
ks
). It returns a vector of hitratesr
, wherer[i]
corresponding to the rankks[i]
.Note that computing hit-rates for multiple ranks jointly is more efficient than computing them separately.
ROC (Receiver Operating Characteristics) is often used to measure the performance of a detector, thresholded classifier, or a verification algorithm.
This package uses an immutable type ROCNums
defined below to capture the ROC of an experiment:
immutable ROCNums{T<:Real}
p::T # positive in ground-truth
n::T # negative in ground-truth
tp::T # correct positive prediction
tn::T # correct negative prediction
fp::T # (incorrect) positive prediction when ground-truth is negative
fn::T # (incorrect) negative prediction when ground-truth is positive
end
One can compute a variety of performance measurements from an instance of ROCNums
(say r
):
-
true_positive(r)
the number of true positives (
r.tp
) -
true_negative(r)
the number of true negatives (
r.tn
) -
false_positive(r)
the number of false positives (
r.fp
) -
false_negative(r)
the number of false negatives (
r.fn
) -
true_postive_rate(r)
the fraction of positive samples correctly predicted as positive, defined as
r.tp / r.p
-
true_negative_rate(r)
the fraction of negative samples correctly predicted as negative, defined as
r.tn / r.n
-
false_positive_rate(r)
the fraction of negative samples incorrectly predicted as positive, defined as
r.fp / r.n
-
false_negative_rate(r)
the fraction of positive samples incorrectly predicted as negative, defined as
r.fn / r.p
-
recall(r)
Equivalent to
true_positive_rate(r)
. -
precision(r)
the fraction of positive predictions that are correct, defined as
r.tp / (r.tp + r.fp)
. -
f1score(r)
the harmonic mean of
recall(r)
andprecision(r)
.
The package provides a function roc
to compute an instance of ROCNums
or a sequence of such instances from predictions.
-
roc(gt, pred)
compute an ROC instance based on ground-truths given in
gt
and predictions given inpred
. -
roc(gt, scores, thres[, ord])
compute an ROC instance based on scores and a threshold
thres
.Prediction is made as follows:
- when
ord = Forward
: predicts 1 whenscores[i] >= thres
otherwise 0. - when
ord = Reverse
: predicts 1 whenscores[i] <= thres
otherwise 0.
When
ord
is omitted, it is defaulted toForward
.When
thres
is a single number, it produces a singleROCNums
instance; whenthres
is a vector, it produces a vector ofROCNums
instances. Jointly evaluating the ROC for multiple thresholds is generally much faster than evaluating for them individually. - when
-
roc(gt, (preds, scores), thres[, ord])
compute an ROC instance based on (unthresholded) predictions, scores and a threshold
thres
.Prediction is made as follows:
- when
ord = Forward
: predictspreds[i]
whenscores[i] >= thres
otherwise 0. - when
ord = Reverse
: predictspreds[i]
whenscores[i] <= thres
otherwise 0.
When
ord
is omitted, it is defaulted toForward
.When
thres
is a single number, it produces a singleROCNums
instance; whenthres
is a vector, it produces a vector ofROCNums
instances. Jointly evaluating the ROC for multiple thresholds is generally much faster than evaluating for them individually. - when
-
roc(gt, scores, n[, ord])
-
roc(gt, (preds, scores), n[, ord])
compute a sequence of ROC instances for
n
evenly spaced thresholds fromminimum(scores)
andmaximum(scores)
. -
roc(gt, scores, ord])
-
roc(gt, (preds, scores), ord])
Respectively equivalent to
roc(gt, scores, 100, ord)
androc(gt, (preds, scores), 100, ord)
. -
roc(gt, scores)
-
roc(gt, (preds, scores))
Respectively equivalent to
roc(gt, scores, 100, Forward)
androc(gt, (preds, scores), 100, Forward)
.
For many machine learning algorithms, the primary goal is to minimize the deviation between the ground-truth signals and the reconstructed ones. This package provides functions to compute various deviations:
-
sqL2dist(a, b)
Squared L2 distance between
a
andb
. -
L2dist(a, b)
L2 distance between
a
andb
. -
L1dist(a, b)
L1 distance between
a
andb
. -
Linfdist(a, b)
Linf distance between
a
andb
. -
gkldiv(a, b)
Generalized Kullback-Leibler divergence between two arrays
a
andb
.Defined as
sum(a .* log(a / b) - a + b)
. Note: whensum(a) == 1
andsum(b) == 1
, it reduces to the KL-divergence in standard sense. -
meanad(a, b)
Mean absolute deviation between
a
andb
, i.e.mean(abs(a - b))
. -
maxad(a, b)
Maximum absolute deviation between
a
andb
, i.e.maximum(abs(a - b))
. -
msd(a, b)
Mean squared deviation between
a
andb
, i.e.mean(abs2(a - b))
. -
rmsd(a, b)
Root mean squared deviation, i.e.
sqrt(msd(a, b))
. -
nrmsd(a, b)
Normalized root mean squared deviation, i.e.
rmsd(a, b) / (maximum(a) - minimum(a))
. -
psnr(a, b, maxv)
Peak signal-to-noise ratio, i.e.
10 * log10(maxv^2 / msd(a, b))
.
Note: all these functions are implemented in a reasonably efficient way without creating any temporary arrays in the middle.