K-Means Clustering with Hadoop MapReduce

This project implements the K-Means clustering algorithm using Hadoop MapReduce. K-Means is a popular unsupervised machine learning algorithm used for clustering data points into K clusters.

Overview

The K-Means algorithm works by iteratively assigning data points to the nearest centroid and updating the centroids based on the assigned points. The algorithm converges when the centroids no longer change significantly. In this project, we leverage the power of Hadoop MapReduce to distribute the computation and handle large-scale datasets.

Usage

To run the K-Means clustering with Hadoop MapReduce, follow the steps below:

Ensure that you have Hadoop installed and configured properly on your system.
Compile the project using the provided Makefile or build script.
Prepare your input data. The input should be a text file with each line representing a data point with its coordinates. Each coordinate should be separated by a comma.
Run the following command to execute the K-Means algorithm:


hadoop jar kmeans.jar it.unipi.hadoop.Main "input" "output" "k" "d" (optional)"threshold"

Replace <input> with the path to your input data file, <output> with the desired output directory, <k> with the number of clusters/centroids you want to generate, d with the points' dimension and threshold with the desired treshold for the stopping criterion (this parameter is optional).

Wait for the execution to complete. The output (log_distances.txt) will be stored in the local output directory and will contain the evolution of the algorithm, including the elapsed time duration. Additionally, if you provided less than 1000 points in 2 dimensions, the folder plots will be populated with the graphs representing each iteration.

Customization

This project allows customization of the K-Means algorithm. You can modify the MapReduce implementation or adjust the parameters to fit your specific needs. Additionally , you can extend the Point class provided in the project to add more functionality or handle data with different dimensions.

giuliocapecchi / Hadoop-KMeans

K-Means Clustering with Hadoop MapReduce

Overview

Usage

Customization

Contributors

References

About

Languages