Canopy Clustering using MapReduce in Hadoop

=

Files Included:

Gen.py
Stage 1: Canopy Center

mapperStg1.py

reducerStg1.py

Stage 2: Canopy Assign

mapperStg2.py

reducerStg2.py

Stage 3: Cluster Center

mapperStg3.py

reducerStg3.py

Stage 4: Cluster Assign:

mapperStg4.py

reducerStg4.py

*Functions of each of the files will be updated at a later date.

Description of the files:

=

Gen.py

-> Generates the Data Set on which we use Canopy-Clustering. -> Generates a set of k-Centroids.

DataPoint.py

-> DataPoint class.

Stage 1: Canopy Center

Mapper:

Input: Data points.

Output: List of Canopy Centers.

Function:

Reducer:

Input: Canopy Centers

Output: Canopy Centers

Function:

Stage 2: Canopy Assign

Mapper:

Input: Canopy Centers

Output: Canopy Centers and the Data Points that belong to each.

Function:

Reducer:

Input: Canopy Centers, Data Points (stdin)

Output: Identity

Function: Echos the result from the Mapper.

Stage 3: Cluster Center

Mapper:

Input:

-> List of 'k' Centroids

-> List of Canopy Centers

-> Canopy Centers, Data Points (stdin)

Output: K Centroids and the Data Points that belong to each.

Function:

Reducer:

Input:

Output:

Function:

Stage 4: Cluster Assign

Mapper:

Input:

Output:

Function:

Reducer:

Input:

Output:

Function:

To replicate running:

Edit the run.sh shell script to run.

Note:

If running on windows cmd, you have to create your own Sort function to sort input from the mapper. Personally, I'd recommend just using a linux OS to smoothen it all out.

Project Members (Alphabetically):

Archit Shukla

Raj Kiran

Sheraaz Jason

Website:

Canopy Clustering in Python using Hadoop (Map Reduce)

About

Canopy Clustering using MapReduce [Hadoop]

Languages

Language:Python 77.3%Language:Shell 22.7%