NikolaZubic / K-MeansClusteringPythonGoPharo

Serial and parallel implementation of k-means clustering vector quantization method in Python and Go, and visualization in Pharo.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

K-MeansClusteringPythonGoPharo

Serial and parallel implementation of k-means clustering vector quantization method in Python and Go, and visualization in Pharo.

Problem description

k-Means clustering is an unsupervised machine learning algorithm that finds a fixed number (k) of clusters in a set of data. A cluster is a group of data points that are organized together due to similarities in their input features. When using a K-Means algorithm, a cluster is defined by a centroid, which is a point at the center of a cluster. Every point in a data set is part of the cluster whose centroid is most closely located. So, k-Means finds k number of centroids, and then assigns all data points to the closest cluster, with the aim of keeping the centroids small (we tend to minimize the distance between points in one cluster, so that they make compact ensemble and to maximize distance between different clusters).

Given a set of observations (x1, x2, ..., xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, ..., Sk} so as to minimize the within-cluster sum of squares. Formally, the objective is to find:

where μi is the mean of points in Si. This is equivalent to minimizing the pairwise squared deviations of points in the same cluster:


Sequential approach

  1. Cluster the data into k groups where k is predefined
  2. Select k points at random as cluster centers
  3. Assign objects to their closest cluster center according to some distance function (for example Euclidean distance)
  4. Calculate the centroid or mean of all objects in each cluster
  5. Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in consecutive rounds

Finding the optimal solution to the k-means clustering problem for observations in d dimensions is:

  • NP-hard in general Euclidean space (of d dimensions) even for two clusters
  • NP-hard for a general number of clusters k even in the plane
  • if k and d (the dimension) are fixed, the problem can be exactly solved in time O(n^(dk + 1)), where n is the number of entities to be clustered

Parallel approach

Main motivation for parallel approach is the fact that k-means clustering performance decreases when we increase number of training examples and when we decide to have a larger k.
The main objective of this project is to improve performance of k-Means clustering algorithm by splitting training examples into multiple partitions and then we calculate distances and assign clusters in parallel. After that, cluster assignments from each partition are combined to check if clusters changed. For iteration I, if clusters changed in iteration (I - 1), we need to recalculate centroids, else we are done.
The whole process is shown on the following diagram (generated with: Flowchart Maker):

Useful reference: https://cse.buffalo.edu/faculty/miller/Courses/CSE633/Chandramohan-Fall-2012-CSE633.pdf

Programs & libraries needed in order to run this project

Python:

  • NumPy : Fundamental package for scientific computing with Python
  • Pandas : Software library written for data manipulation and analysis

Go(lang):

  • GoNum : Set of packages designed to make writing numerical and scientific algorithms productive, performant, and scalable
  • Gota : DataFrames, Series and data wrangling methods for the Go programming language

How to run?

Python:
In kMeansClustering.py program can be run as following: python kMeansClustering.py DATA_CSV_PATH SEQUENTIAL_RESULTS_PATH PARALLEL_RESULTS_PATH SEPARATOR NUMBER_OF_TASKS, where DATA_CSV_PATH is csv_path to read .csv as Pandas DataFrame object, SEQUENTIAL_RESULTS_PATH is path where sequential results for visualization with Pharo will be saved, PARALLEL_RESULTS_PATH is path where parallel results for visualization with Pharo will be saved, SEPARATOR is character that represents separations of columns in .csv file, NUMBER_OF_TASKS defines number of tasks / processes for parallel clustering
In experiments.py user can uncomment sequential_clustering_experiment(), weak_scaling() or strong_scaling() and try experiments out, also parametrize them on its own.

Go(lang):
In kmeans.go program can be run as following: go run kmeans.go DATA_CSV_PATH SEQUENTIAL_RESULTS_PATH PARALLEL_RESULTS_PATH NUMBER_OF_TASKS, where there is no SEPARATOR argument because Gota automatically recognizes them.
By uncommenting sequentialClusteringExperiment(), weakScaling() or strongScaling() user can try experiments out and parametrize them.

Results

Detailed report can be seen here: Report.pdf

For k = 3:
Preview
For k = 4:
Preview

About

Serial and parallel implementation of k-means clustering vector quantization method in Python and Go, and visualization in Pharo.


Languages

Language:Python 99.0%Language:C 0.5%Language:C++ 0.2%Language:TeX 0.1%Language:JavaScript 0.1%Language:Go 0.0%Language:Jupyter Notebook 0.0%Language:Fortran 0.0%Language:CSS 0.0%Language:MATLAB 0.0%Language:HTML 0.0%Language:Smalltalk 0.0%Language:Smarty 0.0%Language:PowerShell 0.0%Language:Batchfile 0.0%Language:Makefile 0.0%