Ensembles and hyperparameter optimization for clustering pipelines.
A clustering pipeline consists of learning and applying a feature transformation, computing distances between observations, and fitting a clustering model. Any parameters of the feature transformation, distance computation and clustering algorithm can be included in the ensemble hyperparameters.
Commonly used feature transformations, distance metrics and clustering algorithms are supported. Any step of the pipeline can be extended by passing custom functions to the ensemble.
Cluster assignments from multiple runs are combined with a consensus clustering model. The consensus is computed by constructing a similarity matrix of co-occurrences from cluster assignments, and fitting a hierarchical clustering model on the similarity matrix. The silhouette score is used to estimate the number of clusters for the final clustering.
An alternative to multiple clusterings aggregation is searching for a single configuration optimized for a clustering evaluation metric. Commonly used intrinsic and extrinsic clustering evaluation metrics can be used in a randomized or pre-specified configuration search.
A HyperEnsemble can be fully specified with three dictionaries: options, parameters and definitions.
A dictionary mapping parameter keys to parameter values for the clustering algorithms,
feature transformations, distance computations, and the rate of observation subsampling.
A categorical distribution over values can be specified by
setting a parameter value with a dict containing either the 'choices'
or 'irange'
key.
'choices'
is used to provide the possible parameter values explicitly, while 'irange'
is used
to specify a range of ints in the same way as for the range()
function.
The 'weights'
key can be used to define the probability for each possible value, or omitted for a uniform distribution.
Examples:
options = {'n_clusters': {'irange': [2, 5]}}
The 'n_clusters'
value will be chosen among 2, 3 or 4 with equal probability.
options = {'n_clusters': {'choices': [2, 3, 4], 'weights': [1, 2, 1]}}
The 'n_clusters'
value will be chosen among 2, 3 or 4 with probability proportional to the values in the 'weights'
list.
A dictionary specifying the dependencies between parameters in the options dict.
For example, parameters = {'hac': ['n_clusters', 'linkage', 'metric']}
means that option 'hac'
requires 'n_clusters'
, 'linkage'
and 'metric'
to be specified.
Providing a parameters dict extends and/or overwrites items in the default parameters dict.
The default parameters dict is
parameters = {'hac': ['linkage', 'n_clusters', 'metric'],
'hacsingle': ['n_clusters', 'metric'],
'hacaverage': ['n_clusters', 'metric'],
'kmeans': ['n_clusters'],
'mbkmeans': ['n_clusters'],
'dbscan': ['eps', 'min_samples', 'metric'],
'rbf': ['gamma'],
'svd': ['n_components'],
'autoencoder': ['layer_dims'],
'convautoencoder': ['kernel_shapes', 'h_dim']}
A dictionary specifying custom transformation and clustering functions. Providing a definitions dict extends and/or overwrites items in the default definitions dict. Default function definitions are provided for all the keys in the default parameters dict.
Any custom function passed to the definitions dict should implement the following signature
customfunc(params, X)
where params
is a dict mapping parameter keys to values and X
is the data matrix.
Feature transformations should return an ndarray of shape (n_observations, n_features)
and distance computations should return a distance matrix: ndarray of shape (n_observations, n_observations)
with zeros in the main diagonal.
Custom functions implementing clustering algorithms should return an object with a labels_
attribute,
an ndarray of shape (n_observations,)
that contains cluster indexes for each observation.
Define the distribution over clustering configurations in the options dict
options = {'model': {'choices': ['hac', 'kmeans']},
'linkage': 'single',
'metric': 'l2',
'n_clusters': {'irange': [2, 10]},
'transformation': {'choices': [None, 'svd'], 'weights': [1, 2]},
'n_components': {'choices': [5, 10]}
}
Fit a consensus clustering model
from consensus import Consensus
consensus_model = Consensus(X, options, precompute='distances').fit()
Sample and evaluate configurations
from evaluation import Evaluator
results = Evaluator(X, options, precompute='distances').random_search()
See examples.py
for more applications.
Implementations of an autoencoder with fully-connected layers and a convolutional autoencoder
are provided in autoencoders.py
.
Both models estimate low dimensional codes by minimizing the reconstruction
loss of encoding and decoding through a lower dimensional bottleneck.