vaivaswatha / hclustering

A simple implementation of hierarchical clustering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

        Hierarchical clustering - Vaivaswatha N

This is an implementation of hierarchical clustering that clusters
based on just distances between points. 

Quantification of points is not needed (as long as distance/similarity 
between points is defined). I wrote this because I needed to cluster 
datasets where the points were just associated with a set (and not 
the usual multi-dimensional euclidian coordinates), and I could only 
define the distances between points (based on how similar the sets of 
two points were). Hence centroid based algorithms such as kmeans could
not be used. The implementation allows specification of custom distance
functions.

Currently single-linkage (LINK_MINIMUM) and complete-linkage 
(LINK_MAXIMUM) are supported. When I shifted form single-linkage to 
complete-linkage: "This results in a preference for compact clusters with 
small diameters over long, straggly clusters, but also causes sensitivity 
to outliers", as described in [3].

Requirements:
  - GCC with support for the new c++11 standard (-std=c++0x)
    (Only for the in-built testing code).

Usage:
  - Include the file "hclustering.h" in your program. The public
    members of the class "HClustering" in this file is the interface
    to the clustering code.
  - Look at the main() function in "hclustering.cpp" for an example
    usage of the clustering interface.
  - Make sure that "hclustering.cpp" is compiled along with your code.
  - To just test the code based on randomly generated data, you can compile
    hclustering.cpp (which defines main() conditionally) this way:
       $g++ -std=c++11 -o hclustering -DTEST_CODE hclustering.cpp
    You can run "./hclustering > clustered_data.txt" to get a set of
    randomly data along with their cluster ids (not necessarily starting
    from 0). This file can be visualised in gnuplot as below:
       gnuplot> plot 'clustered_data.txt' using 1:2:3 with labels
                             OR
       gnuplot> plot 'clustered_data.txt' using 1:2:3 with points palette
  - For improved gnuplot support, uncomment the #define GNUPLOT in 
    "hclustering.cpp". This will create files in /tmp/hclustering/, along
    with a gnuplot script to render them. Just execute gnuplot as:
        $cd /tmp/hclustering/ ; gnuplot plot.gp
    This will display the clustered data.
    NOTE: Make sure /tmp/hclustering/ exists, the program will not create it.
  - There are some non-random sample inputs in sample_data/ directory. Compile
    "main.cpp" in this directory as
        $cd sample_data ; g++ main.cpp -o visualise
    You can then run the program visualise as
        $./visualise data/s1-15.txt 15
    Where data/s1-15.txt is a sample data file, and it has 15 clusters. (every
    file name specifies the number of clusters).

TODO:
  - Still need to do a lot of optimizations, with the most important one
    being smartly caching the distance data (instead of calling user
    distance function everytime).

Contact:
    vaivaswatha@hpc.serc.iisc.in (http://puttu.net/)

References:
[1] http://elki.dbs.ifi.lmu.de/wiki/Tutorial/HierarchicalClustering
[2] http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html
[3] http://nlp.stanford.edu/IR-book/html/htmledition/single-link-and-complete-link-clustering-1.html

About

A simple implementation of hierarchical clustering


Languages

Language:C++ 100.0%