balanced-clustering

Assessing clustering performance in imbalanced data contexts

Class imbalance is prevalent across real-world datasets, including images, natural language, and biological data. In unsupervised learning, clustering performance is often assessed with respect to a ground-truth set of labels using metrics such as the Adjusted Rand Index (ARI). Akin to the issue in classification when using overall accuracy, clustering metrics fail to capture information about class imbalance. imbalanced-clustering presents balanced clustering metrics, that take into account class imbalance and reweigh the results accordingly. Combined with vanilla clustering metrics (https://scikit-learn.org/stable/modules/clustering.html), imbalanced-clustering offers a more complete perspective on clustering and related tasks.

Installation via pip

pip install balanced-clustering

Usage

Currently, there are five balanced metrics that can be used - Balanced Adjusted Rand Index, Balanced Adjusted Mutual Information, Balanced Homogeneity, Balanced Completeness, Balanced V-measure.

import numpy as np
from imbalanced_clustering import balanced_adjusted_rand_index

labels_sim = np.random.choice(["A","B","C"], 1000, replace=True)
cluster_sim = np.random.choice([1,2,3,4,5], 1000, replace=True)
balanced_adjusted_rand_index(labels_sim, cluster_sim)

Notebooks

For more details on the implementation of the balanced clustering metrics, mathematical formalism, and specific use-cases, please see Walkthrough notebook #1. Extended experiments on interpolation behavior between the balanced and vanilla imbalanced metrics can be found in Walkthrough notebook #2.

Please note that Extended Documentation is currently a work in progress

Issues/bugs

If any issues occur in either installation or usage, please open them and include a reproducible example.

adamgayoso / balanced-clustering