K-MeansClusteringPythonGoPharo
Serial and parallel implementation of k-means clustering vector quantization method in Python and Go, and visualization in Pharo.
Problem description
k-Means clustering is an unsupervised machine learning algorithm that finds a fixed number (k) of clusters in a set of data. A cluster is a group of data points that are organized together due to similarities in their input features. When using a K-Means algorithm, a cluster is defined by a centroid, which is a point at the center of a cluster. Every point in a data set is part of the cluster whose centroid is most closely located. So, k-Means finds k number of centroids, and then assigns all data points to the closest cluster, with the aim of keeping the centroids small (we tend to minimize the distance between points in one cluster, so that they make compact ensemble and to maximize distance between different clusters).
Given a set of observations (x1, x2, ..., xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, ..., Sk} so as to minimize the within-cluster sum of squares. Formally, the objective is to find:
Sequential approach
- Cluster the data into k groups where k is predefined
- Select k points at random as cluster centers
- Assign objects to their closest cluster center according to some distance function (for example Euclidean distance)
- Calculate the centroid or mean of all objects in each cluster
- Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in consecutive rounds
Finding the optimal solution to the k-means clustering problem for observations in d dimensions is:
- NP-hard in general Euclidean space (of d dimensions) even for two clusters
- NP-hard for a general number of clusters k even in the plane
- if k and d (the dimension) are fixed, the problem can be exactly solved in time O(n^(dk + 1)), where n is the number of entities to be clustered
Parallel approach
Main motivation for parallel approach is the fact that k-means clustering performance decreases when we increase number of training examples and when we decide to have a larger k.
The main objective of this project is to improve performance of k-Means clustering algorithm by splitting training examples into multiple partitions and then we calculate distances and assign clusters in parallel. After that, cluster assignments from each partition are combined to check if clusters changed. For iteration I, if clusters changed in iteration (I - 1), we need to recalculate centroids, else we are done.
The whole process is shown on the following diagram (generated with: Flowchart Maker):
Programs & libraries needed in order to run this project
Python:
- NumPy : Fundamental package for scientific computing with Python
- Pandas : Software library written for data manipulation and analysis
Go(lang):
- GoNum : Set of packages designed to make writing numerical and scientific algorithms productive, performant, and scalable
- Gota : DataFrames, Series and data wrangling methods for the Go programming language
How to run?
Python:
In kMeansClustering.py
program can be run as following: python kMeansClustering.py DATA_CSV_PATH SEQUENTIAL_RESULTS_PATH PARALLEL_RESULTS_PATH SEPARATOR NUMBER_OF_TASKS
, where DATA_CSV_PATH
is csv_path to read .csv as Pandas DataFrame object, SEQUENTIAL_RESULTS_PATH
is path where sequential results for visualization with Pharo will be saved, PARALLEL_RESULTS_PATH
is path where parallel results for visualization with Pharo will be saved, SEPARATOR
is character that represents separations of columns in .csv file, NUMBER_OF_TASKS
defines number of tasks / processes for parallel clustering
In experiments.py
user can uncomment sequential_clustering_experiment(), weak_scaling() or strong_scaling() and try experiments out, also parametrize them on its own.
Go(lang):
In kmeans.go
program can be run as following: go run kmeans.go DATA_CSV_PATH SEQUENTIAL_RESULTS_PATH PARALLEL_RESULTS_PATH NUMBER_OF_TASKS
, where there is no SEPARATOR
argument because Gota automatically recognizes them.
By uncommenting sequentialClusteringExperiment(), weakScaling() or strongScaling() user can try experiments out and parametrize them.
Results
Detailed report can be seen here: Report.pdf
For k = 3:
For k = 4: