vaivaswatha/hclustering

Hierarchical clustering - Vaivaswatha N

This is an implementation of hierarchical clustering that clusters
based on just distances between points.

Quantification of points is not needed (as long as distance/similarity
between points is defined). I wrote this because I needed to cluster
datasets where the points were just associated with a set (and not
the usual multi-dimensional euclidian coordinates), and I could only
define the distances between points (based on how similar the sets of
two points were). Hence centroid based algorithms such as kmeans could
not be used. The implementation allows specification of custom distance
functions.

Currently single-linkage (LINK_MINIMUM) and complete-linkage
(LINK_MAXIMUM) are supported. When I shifted form single-linkage to
complete-linkage: "This results in a preference for compact clusters with
small diameters over long, straggly clusters, but also causes sensitivity
to outliers", as described in [3].

Requirements:
- GCC with support for the new c++11 standard (-std=c++0x)
(Only for the in-built testing code).

Usage:
- Include the file "hclustering.h" in your program. The public
members of the class "HClustering" in this file is the interface
to the clustering code.
- Look at the main() function in "hclustering.cpp" for an example
usage of the clustering interface.
- Make sure that "hclustering.cpp" is compiled along with your code.
- To just test the code based on randomly generated data, you can compile
hclustering.cpp (which defines main() conditionally) this way:
$g++ -std=c++11 -o hclustering -DTEST_CODE hclustering.cpp
You can run "./hclustering > clustered_data.txt" to get a set of
randomly data along with their cluster ids (not necessarily starting
from 0). This file can be visualised in gnuplot as below:
gnuplot> plot 'clustered_data.txt' using 1:2:3 with labels
OR
gnuplot> plot 'clustered_data.txt' using 1:2:3 with points palette
- For improved gnuplot support, uncomment the #define GNUPLOT in
"hclustering.cpp". This will create files in /tmp/hclustering/, along
with a gnuplot script to render them. Just execute gnuplot as:
$cd /tmp/hclustering/ ; gnuplot plot.gp
This will display the clustered data.
NOTE: Make sure /tmp/hclustering/ exists, the program will not create it.
- There are some non-random sample inputs in sample_data/ directory. Compile
"main.cpp" in this directory as
$cd sample_data ; g++ main.cpp -o visualise
You can then run the program visualise as
$./visualise data/s1-15.txt 15
Where data/s1-15.txt is a sample data file, and it has 15 clusters. (every
file name specifies the number of clusters).

TODO:
- Still need to do a lot of optimizations, with the most important one
being smartly caching the distance data (instead of calling user
distance function everytime).

Contact:
vaivaswatha@hpc.serc.iisc.in (http://puttu.net/)

References:
[1] http://elki.dbs.ifi.lmu.de/wiki/Tutorial/HierarchicalClustering
[2] http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html
[3] http://nlp.stanford.edu/IR-book/html/htmledition/single-link-and-complete-link-clustering-1.html

vaivaswatha / hclustering

About

Languages