urschrei / ckmeans_py

A Python package for fast 1D clustering using a Rust binary

Home Page:https://pypi.org/project/ckmeans/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CKmeans: Optimal Univariate Clustering

Ckmeans clustering is an improvement on 1-dimensional (univariate) heuristic-based clustering approaches such as Jenks. The algorithm was developed by Haizhou Wang and Mingzhou Song (2011) as a dynamic programming approach to the problem of clustering numeric data into groups with the least within-group sum-of-squared-deviations.

Minimizing the difference within groups – what Wang & Song refer to as withinss, or within sum-of-squares – means that groups are optimally homogenous within and the data is split into representative groups. This is very useful for visualization, where one may wish to represent a continuous variable in discrete colour or style groups. This function can provide groups that emphasize differences between data.

Being a dynamic approach, this algorithm is based on two matrices that store incrementally-computed values for squared deviations and backtracking indexes.

Unlike the original implementation, this implementation does not include any code to automatically determine the optimal number of clusters: this information needs to be explicitly provided. It does provide the roundbreaks method to aid labelling, however.

Implementation

This library uses the ckmeans Rust crate, by the same author.

Benchmarks

Install optional dependencies, then run benchmark.py.

ckmeans-1d-dp is about 20 % faster, but note that it only returns indices identifying each cluster to which the input belongs; if you actually want to cluster your data, you need to do that yourself which I strongly suspect might be slower overall. On the other hand, if all you want is indices it may be a better choice.

Example

from ckmeans import ckmeans
import numpy as np


data = np.array([1.0, 2.0, 3.0, 4.0, 100.0, 101.0, 102.0, 103.0])
clusters = 2
result = ckmeans(data, clusters)
assert result == [
    np.array([1.0, 2.0, 3.0, 4.0]),
    np.array([100.0, 101.0, 102.0, 103.0])
]

About

A Python package for fast 1D clustering using a Rust binary

https://pypi.org/project/ckmeans/

License:MIT License


Languages

Language:Rust 63.7%Language:Python 36.3%