rahulpshah/DocumentClusteringHadoop

##CSC522 - Document Clustering using KMeans on Hadoop

##Dataset

https://archive.ics.uci.edu/ml/datasets/Bag+of+Words

##Important Files

preprocess.py - It preprocesses the data into the required format.

docid:{(word_id;count),(word_id;count),...,(wordid;count)}

KMeans.java - Driver program that sets the job configuration

KMeansMapper.java - Mapper program that implements the mapping of vector to their respective clusters

KMeansReducer.java - Reducer program for recalcuting the mean

Vector.java - Custom vector class to represent the document

Distance.java - Auxillary method to implement various distance measure for evaluations

##Dependencies

Hadoop Framework (Apache Hadoop 2.7.2)
SLF Logging Library (slf4j-1.7.21)

##Installation and Running

The program can be run using the uploaded the Runnable JAR file (kmeans.jar).

Preprocess the document using preprocess.py and upload it on hadoop


$> python preprocess.py text_file > input_file
$> hdfs dfs -put input_file /
$> hadoop jar kmeans.jar path/to/data /output/directory K max_iterations
$> hadoop dfs -cat /outputFINAL_N/part-r-00000

##Example


$> python preprocess.py text.txt > input.txt
$> hdfs dfs -put input.txt /
$> hadoop jar kmeans.jar /input.txt /output 4 1000
$> hadoop dfs -cat /outputFINAL_4/part-r-00000

##Contributors

<li>Rahul Prashant Shah (rshah5@ncsu.edu)</li>
<li>Aniket Hiren Patel (apatel10@ncsu.edu)</li>
<li>Ethan Swartzentruber (epswartz@ncsu.edu)</li>
<li>Ming Dai (mdai3@ncsu.edu)</li>
<li>Rongjin Wang (rwang10@ncsu.edu)</li>
<li>Aditya Mandhare (amandha3@ncsu.edu)</li>

rahulpshah / DocumentClusteringHadoop

About

Languages