bread-tan / canopyClusteringPython

Canopy Clustering using MapReduce [Hadoop]

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Canopy Clustering using MapReduce in Hadoop

=

Files Included:

  • Gen.py
  • Stage 1: Canopy Center
    • mapperStg1.py
    • reducerStg1.py
  • Stage 2: Canopy Assign
    • mapperStg2.py
    • reducerStg2.py
  • Stage 3: Cluster Center
    • mapperStg3.py
    • reducerStg3.py
  • Stage 4: Cluster Assign:
    • mapperStg4.py
    • reducerStg4.py

*Functions of each of the files will be updated at a later date.

Description of the files:

=

Gen.py

-> Generates the Data Set on which we use Canopy-Clustering. -> Generates a set of k-Centroids.

DataPoint.py

-> DataPoint class.

Stage 1: Canopy Center

  • Mapper:
    • Input: Data points.
    • Output: List of Canopy Centers.
    • Function:
  • Reducer:
    • Input: Canopy Centers
    • Output: Canopy Centers
    • Function:

Stage 2: Canopy Assign

  • Mapper:
    • Input: Canopy Centers
    • Output: Canopy Centers and the Data Points that belong to each.
    • Function:
  • Reducer:
    • Input: Canopy Centers, Data Points (stdin)
    • Output: Identity
    • Function: Echos the result from the Mapper.

Stage 3: Cluster Center

  • Mapper:
    • Input:
      • -> List of 'k' Centroids
        -> List of Canopy Centers
        -> Canopy Centers, Data Points (stdin)
    • Output: K Centroids and the Data Points that belong to each.
    • Function:
  • Reducer:
    • Input:
    • Output:
    • Function:

Stage 4: Cluster Assign

  • Mapper:
    • Input:
    • Output:
    • Function:
  • Reducer:
    • Input:
    • Output:
    • Function:

To replicate running:

Edit the run.sh shell script to run.

Note:

If running on windows cmd, you have to create your own Sort function to sort input from the mapper. Personally, I'd recommend just using a linux OS to smoothen it all out.

Project Members (Alphabetically):

Website:

Canopy Clustering in Python using Hadoop (Map Reduce)

About

Canopy Clustering using MapReduce [Hadoop]


Languages

Language:Python 77.3%Language:Shell 22.7%