canonizer/gpumafia

INTRO

gpumafia implements MAFIA clustering algorithm, mostly as described in the paper
"Parallel Algorithms for Clustering High-Dimensional Large-Scale Datasets" by
H. Nagesh, S. Goil and A. Choudhary. Refer to this paper for the detailed
description of the algorithm as well as the parameters involved. This
implementation can run both on CPUs (single-core only) as well as GPUs (only a
single GPU). The implementation mostly follows the paper, while introducing some
additional algorithmic optimizations.

This code has been developed at Juelich Supercomputing Center, Forschungszentrum
Juelich, and is a property of Forschungszentrum Juelich. Users can modify and/or
distribute code free of charge, provided that this notice is retained.

The implementation expects a single file with data points, and writes output
files, one or two per cluster, in the same directory where the input file is
located. Indices of points belonging to the cluster, and optionally, the points
themselves, are the output of the program.

COMPILING

The main project directory is cppmafia; all further commands are given with
respect to that directory.

./makefile is the makefile for the project, and make definitions are in
makefile.def. There are currently two definitions: the C++ compiler used (CXX),
and where the program will be installed (PREFIX). 

By default, NVidia Compiler (nvcc) is used, though this can be changed to GCC
(g++) or Intel Compiler (icc). GPU support is only enabled when nvcc is used.

To compile the project, execute the following command in the project directory:

	 make

To install the cppmafia program:

	 make install

To uninstall the program:
	 
	 make uninstall


USING

To run the program, just execute the following command:

	 cppmafia <point-file>

The point file usually has .dat extension. Files containing cluster data will be
created in the directory of <point-file>, one file with point indices (.idx) per
cluster. 

The entire options of the program can be obtained with the following command:

		cppmafia -h
	
or

		cppmafia --help

Some of the more useful options are (GNU option syntax supported):

		 -p,--output-points - write out point data in addition to point
      indices. However, this may considerably increase I/O time

		 -d,--device - use GPUs; the program will fail if no CUDA GPUs are available

		 -V,--verbose - output results of algorithm's intermediate steps

Options for specifying algorithm parameters (GNU option syntax supported):

		 -a,--alpha <alpha> - dense window threshold (alpha), default 1.5
		 -b,--beta <beta> - window merge threshold (beta), default 0.25
		 -n,--bins <nbins> - minimum number of bins per dimension, default 1000
		 -u,--min-wins <nwindows> - number of windows for uniform dimensions,
		 							 default 5
		 -M,--max-wins <nwindows> - maximum number of windows per dimension,
		 							 default 20

And though the case when number of bins is not divisible by the number of
windows (either -u or -M) should work, it has not been tested. So it better be
divisible. 


INPUT DATA 

If you want to use MAFIA for something serious, you probably have them
already. However, if you want just to play with the utility, profile it or test
it for correctness and/or performance, there is a data generator provided. See
utils/clugen for the data generator. You can compile it by doing make (in
that directory), and then generate different datasets with it. E.g.,

bin/clugen -n100000 -d10 -k8 -m2 ~/try/mafia/cluster-8.dat

will generate a 100,000-point 10-dimensional dataset, with 2 8-dimensional
clusters, roughly 45,000 points each. The details of each cluster would be
printed to stdout of clugen. When that file is input into MAFIA, it should then
find the same or similar clusters. Generally, MAFIA should find one cluster with
default parameters robustly; with more clusters, robustness decreases. To
generate clusters which do not intersect in any dimensions (which is somewhat
unrealistic), use clugen's -N switch.

CODE

options.{h,cpp} contain code for defining and parsing options. 
timing.{h,cpp} contain helper functions for collecting timing information. 
ref.h is the implementation of a smart pointer to a reference-counted object. 
mafia-io.{h,cpp} contains functions for reading point files and writing out the
resulting clusters to the file system.

utils.{h,cpp} contain some utility code; this is currently limited to some
macros, and bulk_{alloc,free} functions, which are used to allocate big data
structures. bulk* functions work with CUDA pinned host memory when the device is
used; otherwise, they just call malloc(). malloc() is still used directly or
through STL for smaller memory allocations.

main.cpp is some glue code which parses the command line, reads point data, runs
the MAFIA algorithm, and writes out resulting clusters. 

The code of the MAFIA algorithm is mostly concentrated in mafia-solver.{h,cpp},
which contain host code, and mafia-solver-device.cu, which contains device
code. (C)DUs and adaptive windows are moved to separate classes; they are
contained in cdu.{h,cpp} and window.{h,cpp}, repsectively. 

MafiaSolver<T> is the class implementing MAFIA algorithm. It is parameterized
with point coordinate type (float or double), double is used by default. The
method that implements MAFIA algorithm is solve().

The algorithm starts with building the histograms, build_histos(), which can be
done on device. build_windows() then builds windows, always on host. Building
windows starts from uniform windows, generated by
build_uniform_windows(). Depending on their point counts, uniform windows can be
merged into adaptive windows. Finally, build_bitmaps() builds the bitmaps, for
which the device can be used. Bitmaps are used for counting points, and this is
the modification of the original algorithm. Each bitmap stores information on
points belonging to a particular dense adaptive window, with only 1 bit used per
point. This saves both memory bandwidth and arithmetic operations, compared to
direct counting.

After building histograms, windows and bitmaps, the algorithm enters the main
loop, which continues until candidate dense units (CDUs) can no longer be
produced. The algorithm starts with CDUs of dimension one, and increases CDU
dimensionality at each iteration by 1.

Each iteration consists of generating CDUs (find_cdus()), checking which of them
are dense (find_dus()), and finally checking for any unassimilated DUs left
(find_unjoined_dus()). Unassimilated DUs are added to the list of terminal DUs,
which is used to construct the final clusters.

To find CDUs (find_cdus()), the algorithm takes all pairs of DUs of
dimensionality a-1 and checks if they can be merged into a CDU of dimensionality
a. It is so if they contain a common subsequence of dimensionality a-2; the
check is performed in Cdu::can_merge_with() function in cdu.cpp. Any
successfully generated CDUs are added to a set, which also prevents generating
duplicate CDUs several times. For dimension 1, all dense windows are considered CDUs.

To check if newly found CDUs are dense (find_dus()), the number of points inside
each CDU needs to be counted, and then compared with the thresholds of
windows. To do this, and operation is performed on bitmaps of CDUs windows, and
the number of bits is counted in the resulting bitmap, which gives the number of
points. If the device (GPU) is used, this operation is performed on the device;
this gives considerable (up to 50x) acceleration compared to doing this on
host. 

Finally, unassimilated DUs from previous iteration need to be moved to the list
of terimal DUs (find_unjoined_dus()). For this, each DU in previous dimension is
checked against the newly found DUs to see whether the old DU has been
assimilated. The assimilation check is performed in Cdu::is_assimilated_into()
method. 

After there are no more CDUs, the algorithm proceeds to building the final
clusters. To achieve this, all terminal DUs are added to a graph
(build_du_graph()), where the DUs are the nodes, and a (bidirectional) edge
exists between two DUs of the same dimensionality only if they share a common
face. The connected components of the graph are then found, by performing
breadth-first search repeatedly until no unvisited nodes are left
(build_du_clusters()). Each connected component is then said to constitute a
single cluster. Finally, for each cluster the array of indices of points
belonging to it is built (build_clusters()).
canonizer / gpumafia

About

Languages