krenova / MixedDistributionMixtureModels

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mixed Distribution Mixture Models (MDMM)

This is a mixed distributed clustering model that uses the Gaussian Distribution and Multinomial Distribution to segment mixed typed data. For example, datapoints with a mixture of numerical, categorical or multinomial.

Advantage of using mixed distributions to cluster mixed typed data is better segmentation due to the use of appropriate distributions to capture the respective data type characteristics. On a more fundamental level, such a clustering model also circumvent problems with using distance measure algorithms, for example K-Means, to cluster categorical and multinomial data. While there are excellent algorithms out there that are designed to handle categorical and multinomial data, see ROCK clustering or other variants of mixture models, these algorithms do not work well with numerical data. MDMM is therefore designed to address these common problems faced when working with mixed data type. As MDMM is written using R's C++ api, Rcpp, and uses the Expectation-Maximization (EM) algorithm to infer model parameters, the code runs relatively fast and should have no problems handling data of a million data points.

Codes

The main codes are:

  1. mdmmCpp.R
  2. mdmmCore.cpp

where mdmmCpp.R is the file that your R script would source from and mdmmCore.cpp is the C++ code that mdmmCpp.R is built on.

For a demonstration on the use of the clustering function and also compare the speeds for an equivalent code written in R, do refer to the following jupyter notebook:

Development

As would be expected of an EM algorithm, the log-likelihood should be monotonically increasing. However, there are instances where the log-likelihood dips which goes against EM theory. Work is needed to determine if the cause of the dips is due to numerical overflows or bugs.

To note that despite this, through simulations, the algorithm's parameters have always converged to the theoretical. Implying that the bug should not pose any inference issues (however, use at your own risk!). Convergence is based on estimated parameter stabilization.

Reading References

[1] http://datascience.stackexchange.com/questions/22/k-means-clustering-for-mixed-numeric-and-categorical-data

About

License:GNU General Public License v3.0


Languages

Language:Jupyter Notebook 83.9%Language:R 11.5%Language:C++ 4.7%