Hierarchical clustering - Vaivaswatha N This is an implementation of hierarchical clustering that clusters based on just distances between points. Quantification of points is not needed (as long as distance/similarity between points is defined). I wrote this because I needed to cluster datasets where the points were just associated with a set (and not the usual multi-dimensional euclidian coordinates), and I could only define the distances between points (based on how similar the sets of two points were). Hence centroid based algorithms such as kmeans could not be used. The implementation allows specification of custom distance functions. Currently single-linkage (LINK_MINIMUM) and complete-linkage (LINK_MAXIMUM) are supported. When I shifted form single-linkage to complete-linkage: "This results in a preference for compact clusters with small diameters over long, straggly clusters, but also causes sensitivity to outliers", as described in [3]. Requirements: - GCC with support for the new c++11 standard (-std=c++0x) (Only for the in-built testing code). Usage: - Include the file "hclustering.h" in your program. The public members of the class "HClustering" in this file is the interface to the clustering code. - Look at the main() function in "hclustering.cpp" for an example usage of the clustering interface. - Make sure that "hclustering.cpp" is compiled along with your code. - To just test the code based on randomly generated data, you can compile hclustering.cpp (which defines main() conditionally) this way: $g++ -std=c++11 -o hclustering -DTEST_CODE hclustering.cpp You can run "./hclustering > clustered_data.txt" to get a set of randomly data along with their cluster ids (not necessarily starting from 0). This file can be visualised in gnuplot as below: gnuplot> plot 'clustered_data.txt' using 1:2:3 with labels OR gnuplot> plot 'clustered_data.txt' using 1:2:3 with points palette - For improved gnuplot support, uncomment the #define GNUPLOT in "hclustering.cpp". This will create files in /tmp/hclustering/, along with a gnuplot script to render them. Just execute gnuplot as: $cd /tmp/hclustering/ ; gnuplot plot.gp This will display the clustered data. NOTE: Make sure /tmp/hclustering/ exists, the program will not create it. - There are some non-random sample inputs in sample_data/ directory. Compile "main.cpp" in this directory as $cd sample_data ; g++ main.cpp -o visualise You can then run the program visualise as $./visualise data/s1-15.txt 15 Where data/s1-15.txt is a sample data file, and it has 15 clusters. (every file name specifies the number of clusters). TODO: - Still need to do a lot of optimizations, with the most important one being smartly caching the distance data (instead of calling user distance function everytime). Contact: vaivaswatha@hpc.serc.iisc.in (http://puttu.net/) References: [1] http://elki.dbs.ifi.lmu.de/wiki/Tutorial/HierarchicalClustering [2] http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html [3] http://nlp.stanford.edu/IR-book/html/htmledition/single-link-and-complete-link-clustering-1.html