derrickburns / generalized-kmeans-clustering

Spark library for generalized K-Means clustering. Supports general Bregman divergences. Suitable for clustering probabilistic data, time series data, high dimensional data, and very large data.

Home Page:https://generalized-kmeans-clustering.massivedatascience.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Generalized K-Means Clustering

This project generalizes the Spark MLLIB Batch K-Means (v1.1.0) clusterer and the Spark MLLIB Streaming K-Means (v1.2.0) clusterer. Most practical variants of K-means clustering are implemented or can be implemented with this package, including:

If you find a novel variant of k-means clustering that is provably superior in some manner, implement it using the package and send a pull request along with the paper analyzing the variant!

This code has been tested on data sets of tens of millions of points in a 700+ dimensional space using a variety of distance functions. Thanks to the excellent core Spark implementation, it rocks!

About

Spark library for generalized K-Means clustering. Supports general Bregman divergences. Suitable for clustering probabilistic data, time series data, high dimensional data, and very large data.

https://generalized-kmeans-clustering.massivedatascience.com/

License:Apache License 2.0


Languages

Language:HTML 93.4%Language:Scala 4.6%Language:JavaScript 1.0%Language:CSS 1.0%