alvarocollet / painless_gmm

A real-world implementation of a Gaussian Mixture Model in C++, without the pain.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tutorial on Painless Gaussian Mixture Models

A real-world implementation of a Gaussian Mixture Model in C++, without the pain.

Introduction to Painless GMM

A Gaussian Mixture Model (GMM) is a probability distribution defined as a linear combination of weighted Gaussian distributions. It is commonly used in computer vision and image processing tasks, such as estimating a color distribution for foreground/background segmentation, or in clustering problems. This project is intended as an educational tool on how to properly implement a Gaussian Mixture Model.

GMMs are annoying to implement. The math behind GMMs is very easy to understand, but it is not possible to take the formulas and implement them directly. A straight implementation of the GMM formulas leads to underflow errors, singular matrices, divisions-by-zero, and NaNs. The likelihoods involved in GMM are very frequently too small to be directly represented as floating-point numbers (and, even more so, their multiplication). In the following paragraphs and code, I show the changes needed to take GMM from theory to a robust real-world implementation. Therefore, this is an implementation of GMM without the pain: a Painless GMM.

GMM: The theory

A GMM is a probability distribution defined as a linear combination of equation weighted Gaussian distributions,

            equation

with weights equation, means equation and covariance matrices equation. We simplify this notation in the following section as equation, and equation. The GMM likelihood then becomes equation.

For more information about GMMs, visit Reynold's gmm tutorial or the Wikipedia page.

Training a GMM with Expectation-Maximization (EM)

We start with a data set equation of equation equation-dimensional feature vectors equation (e.g., equation for RGB color pixels), an initial set of equation Gaussian distributions (initialized as described below), and equation weights equation. We use the Expectation-Maximization (EM) algorithm to optimize the Gaussian distributions and weights that maximize the global GMM likelihood equation, that is, the mixture of Gaussian distributions and weights that best fit the data set equation.

The EM algorithm is an optimization algorithm which maximizes equation by coordinate ascent, alternating between expectation steps (E-steps) and maximization steps (M-steps). The algorithm starts with an initial E-step.

In the E-step, we determine the responsibility equation of each Gaussian distribution for each training data point equation, as

            equation

that is, we estimate how likely each Gaussian distribution is to generate the data point equation.

In the M-step, we re-estimate the gaussian distributions and weights given the responsibilities equation. In particular, we update equation, equation and equation as

           equation       equation       equation

Note that equation and equation are considered column vectors, so that the outer product equation results in a equation matrix.

We then alternate between the E-Step and M-step until equation does not increase significantly anymore. For example, a common stopping criterion is when equation increases less than 0.01%, or after 100 iterations.

Avoiding underflow errors

The procedure described above for GMM training is correct, but it is not possible to implement directly. A straight implementation of the previous formulas leads to underflow errors and singular matrices, which we must avoid in a robust implementation.

The likelihoods and responsibilities involved in GMM are very frequently too small to be directly represented as floating-point numbers (and, even more so, their multiplication). An effective solution of the underflow problem is to use log likelihoods and the logsumexp trick.

First, we must use the Gaussian log likelihood equation instead of the linear likelihood, as

           equation

We must also use the log likelihood of the whole GMM instead of the linear likelihood, but this calculation is slightly more convoluted. Given that the formula equation performs likelihood additions, the use of equation does not pose any immediate advantage (because we cannot directly add log likelihoods). We use instead the logsumexp trick LSE(), which states that

           equation

In LSE(), we scale the equation terms in the summation by the largest term equation and convert the scaled terms to linear domain instead.

Let us give an example of the logsumexp trick at work. Let equation and equation. We wish to compute equation. The direct evaluation of equation requires calculating equation and equation, which causes underflow (regardless of the representation, float or double), and therefore equation. Using logsumexp, this is equation.

The GMM log likelihood can be expressed as equation. Given that we already calculate equation instead of equation, the GMM log likelihood becomes

           equation

           equation

Analogously, the E-step becomes

           equation

and the responsibilities equation are computed from equation, i.e., equation.

The M-step does not require any changes to prevent underflows. Finally, the global GMM log likelihood becomes

           equation

Avoiding singular matrix inversions

The second main problem in a robust GMM implementation is the appearance of singular matrix inversions. This issue commonly arises with low-variance patches. For example, an image with a section of saturated pixels (e.g., camera is pointing to a light) contains an area with constant color and zero variance. If we attempt to train a GMM in such an image, all pixels in the zero-variance patch will be clustered together, but the evaluation of the Gaussian log likelihood equation will fail because it requires an inverse covariance matrix. A patch with constant color has a singular covariance matrix (all zeros), which is not invertible.

The simplest solution to this problem is to add bounds to the computation of the estimated covariance matrices. In particular, after evaluating each covariance matrix, we evaluate its reciprocal condition number

           equation

In well-conditioned matrix inversions, RCOND is close to 1, whereas it approaches zero for ill-conditioned (close to singular) matrix inversions. In our implementation, we monitor each matrix so that equation, with equation. If this condition is not met, we force equation to be a diagonal matrix with small (but well-conditioned) variance.

Initialization

The training of a GMM requires some initialization for the means and covariances. A common approach is to use K-Means as a starting point. In our case, we implemented a basic K-Means algorithm with Forgy initialization. We use the output cluster centroids and cluster variances to initialize our GMM distribution, with the cluster centroid becoming the GMM means equation and the cluster variances becoming diagonal covariance matrices equation.

Data whitening

Clustering algorithms like K-Means and GMM show slower convergence properties when the data is badly scaled, or if there is a great disparity in the variance of different features. A common solution to this problem is to perform a data whitening step prior to clustering. To whiten a data set, we rescale each feature (e.g., the R, G, and B channels in an RGB pixel) in the feature vector equation so that it has unit variance. Consider the scaling matrix

           equation

The data whitening of the feature vector equation is then

           equation

We use the whitened data set equation as an input to K-Means and then to GMM. After the GMM has converged on the whitened data, we rescale the whitened means equation and covariances equation to their original values. In particular, equation and equation.

About

A real-world implementation of a Gaussian Mixture Model in C++, without the pain.

License:MIT License


Languages

Language:C++ 99.2%Language:C 0.8%