K-means clustering implementation. Acceptable file format is .arff
file in order to directly compare results with Weka. The majority of testing was done on the iris.arff
dataset. Testing on this dataset had a few advantages. Namely, there are only 4
attributes for each instance of data (transaction). This means that the algorithm was only ever dealing with 4-dimensions. This made it much easier to comprehend what was going on. Further, the dataset is easily separable on any 3-axis
combination of the 4
explanatory variables, which means that we could easily plot the visual plot to make sure that clusters were converging to where we would expect.
Documents Folder
: contains a PDF of the report.
Data Folder
: contains all .arff data sets used in the testing of this implementation.
src Folder
: contains all code for this assignment, split up into generateKClusters.py and kmeans.py. The kmeans.py file contains all logic related to the actual k-means algorithm, while the generateKClusters.py contains functions needed to parse command line arguments, run algorithm, and plot the results.
Note that a few Python packages used in this implementation prevent the program from being executable on the CSE server. However, we have included a file called requirements.txt
, which will allow you to batch install all of the required dependencies for this implementation via pip. All you have to do is run the following command, pip3 install -r requirements.txt
.
There are 4
main inputs for the program, which are entered via CLI arguments:
-f
: represents the input data file to be read parsed and read in-k
: represents the number of clusters that the clustering algorithm should try to find-e
: represents a threshold such that if the change in sum of the distances from cluster centers decreases below this value, the program will terminate-i
: represents the number of iterations to run before terminating if the other terminating conditions are not met- [Optional]
-s
: represents the value of the seed value when usingrandom()
to force pseudo-random functions to behave deterministically. This helps to ensure repeatability. If not specified, then the k-means clustering algorithm will use a default value of10
. - [Optional]
-n
: represents a boolean flag that tells the program if it should normalize the data set before running the k-means algorithm.
The program outputs information about the clustering results from k-means. This information includes initial centroid coordinates, runtime, resulting cluster attributes, and total cluster membership.
The program can be started by running the following command that correspond to the input parameters listed above:
python generateKClusters.py -f <input_file> -k <num_clusters> -e <epsilon> -i <max_iterations> -s 1
Or, if you don't specify the -s
CLI argument, which denotes the value of the seed variable, then you would use the following command:
python generateKClusters.py -f <input_file> -k <num_clusters> -e <epsilon> -i <max_iterations>
If your default version of Python is Python 2.x
, you will need to specify python3
on the command line. Otherwise, running python
will default to Python 3.x
.
A number of additional functions were written to assist in the plotting and printing this algorithm's results. However, none are directly called by the current main()
method in generateKClusters.py
.
• Assume that all the attributes are continuous variables.
• Your program must allow the number of clusters (k) to be specified as input.
• Your program must allow the epsilon (change in the sum of the distances from the cluster centers) to be specified as input.
• Your program must allow the number of iterations to be specified as input.
The program will stop if either of the following conditions hold:
- The number of iterations is reached
- The change in the total sum of the squares of the distances (SSD) falls below epsilon