LTLA / CppKmeans

C++ port of R's Hartigan-Wong implementation

Home Page:https://ltla.github.io/CppKmeans/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

C++ library for k-means

Unit tests Documentation stats comparison Codecov

Overview

This repository contains a header-only C++ library for k-means clustering. Initialization can be performed with user-supplied centers, random selection of points, weighted sampling with kmeans++ (Arthur and Vassilvitskii, 2007) or PCA partitioning (Su and Dy, 2007). Refinement can be performed using the Hartigan-Wong approach or Lloyd's algorithm. The Hartigan-Wong implementation is derived from the Fortran code in the R stats package, heavily refactored for more idiomatic C++.

Quick start

kmeans is a header-only library, so it can be easily used by just #includeing the relevant source files:

#include "kmeans/Kmeans.hpp"

// assorted boilerplate here...

// Wrap your matrix in a SimpleMatrix.
kmeans::SimpleMatrix kmat(ndim, nobs, matrix.data());

auto res = kmeans::compute(
    kmat,
    kmeans::InitializeKmeanspp(), // initialize with kmeans++
    kmeans::RefineLloyd(), // refine with Lloyd's algorithm
    ncenters 
);

res.centers; // Matrix of centroid coordinates, stored in column-major format
res.clusters; // Vector of cluster assignments
res.details; // Details from the clustering algorithm

// Compute the WCSS if we want it:
std::vector<double> wcss(ncenters);
kmeans::compute_wcss(
    kmat, 
    ncenters, 
    res.centers.data(), 
    res.clusters.data(), 
    wcss.data()
);

If we already allocated arrays for the centroids and clusters, we can fill the arrays directly:

std::vector<double> centers(ndim * ncenters);
std::vector<int> clusters(nobs);

auto deets = kmeans::compute(
    kmat,
    kmeans::InitializeRandom(), // random initialization
    kmeans::RefineHartiganWong(), // refine with Hartigan-Wong 
    ncenters 
    centers.data(),
    clusters.data()
);

We can tune the clustering by passing options into the constructors of the relevant classes:

kmeans::InitializePcaPartitionOptions pp_opt;
pp_opt.power_iteration_options.iterations = 200;
pp_opt.seed = 42;
kmeans::InitializePcaPartitions pp(pp_opt);

kmeans::RefineLloydOptions ll_opt;
ll_opt.max_iterations = 10;
ll_opt.num_threads = 3;
kmeans::RefineLloyd ll(ll_opt);

auto res2 = kmeans::compute(kmat, pp, ll, ncenters);

See the reference documentation for more details.

Building projects

CMake with FetchContent

If you're using CMake, you just need to add something like this to your CMakeLists.txt:

include(FetchContent)

FetchContent_Declare(
  kmeans 
  GIT_REPOSITORY https://github.com/LTLA/CppKmeans
  GIT_TAG master # or any version of interest
)

FetchContent_MakeAvailable(kmeans)

Then you can link to kmeans to make the headers available during compilation:

# For executables:
target_link_libraries(myexe ltla::kmeans)

# For libaries
target_link_libraries(mylib INTERFACE ltla::kmeans)

CMake with find_package()

To install the library, clone an appropriate version of this repository and run:

mkdir build && cd build
cmake .. -DKMEANS_TESTS=OFF
cmake --build . --target install

Then we can use find_package() as usual:

find_package(ltla_kmeans CONFIG REQUIRED)
target_link_libraries(mylib INTERFACE ltla::kmeans)

By default, this will use FetchContent to fetch all external dependencies (see extern/CMakeLists.txt for a list). If you want to install them manually, use -DKMEANS_FETCH_EXTERN=OFF.

Manual

If you're not using CMake, the simple approach is to just copy the files in include/ - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I. This requires the external dependencies listed in extern/CMakeLists.txt, which also need to be made available during compilation.

References

Hartigan, J. A. and Wong, M. A. (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics 28, 100-108.

Arthur, D. and Vassilvitskii, S. (2007). k-means++: the advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027-1035.

Su, T. and Dy, J. G. (2007). In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering, Intelligent Data Analysis 11, 319-338.

About

C++ port of R's Hartigan-Wong implementation

https://ltla.github.io/CppKmeans/

License:MIT License


Languages

Language:C++ 95.8%Language:CMake 2.6%Language:R 1.6%