A real-world implementation of a Gaussian Mixture Model in C++, without the pain.
A Gaussian Mixture Model (GMM) is a probability distribution defined as a linear combination of weighted Gaussian distributions. It is commonly used in computer vision and image processing tasks, such as estimating a color distribution for foreground/background segmentation, or in clustering problems. This project is intended as an educational tool on how to properly implement a Gaussian Mixture Model.
GMMs are annoying to implement. The math behind GMMs is very easy to understand, but it is not possible to take the formulas and implement them directly. A straight implementation of the GMM formulas leads to underflow errors, singular matrices, divisions-by-zero, and NaNs. The likelihoods involved in GMM are very frequently too small to be directly represented as floating-point numbers (and, even more so, their multiplication). In the following paragraphs and code, I show the changes needed to take GMM from theory to a robust real-world implementation. Therefore, this is an implementation of GMM without the pain: a Painless GMM.
A GMM is a probability distribution defined as a linear combination of weighted Gaussian distributions,
with weights , means and covariance matrices . We simplify this notation in the following section as , and . The GMM likelihood then becomes .
For more information about GMMs, visit Reynold's gmm tutorial or the Wikipedia page.
We start with a data set of -dimensional feature vectors (e.g., for RGB color pixels), an initial set of Gaussian distributions (initialized as described below), and weights . We use the Expectation-Maximization (EM) algorithm to optimize the Gaussian distributions and weights that maximize the global GMM likelihood , that is, the mixture of Gaussian distributions and weights that best fit the data set .
The EM algorithm is an optimization algorithm which maximizes by coordinate ascent, alternating between expectation steps (E-steps) and maximization steps (M-steps). The algorithm starts with an initial E-step.
In the E-step, we determine the responsibility of each Gaussian distribution for each training data point , as
that is, we estimate how likely each Gaussian distribution is to generate the data point .
In the M-step, we re-estimate the gaussian distributions and weights given the responsibilities . In particular, we update , and as
Note that and are considered column vectors, so that the outer product results in a matrix.
We then alternate between the E-Step and M-step until does not increase significantly anymore. For example, a common stopping criterion is when increases less than 0.01%, or after 100 iterations.
The procedure described above for GMM training is correct, but it is not possible to implement directly. A straight implementation of the previous formulas leads to underflow errors and singular matrices, which we must avoid in a robust implementation.
The likelihoods and responsibilities involved in GMM are very frequently too small to be directly represented as floating-point numbers (and, even more so, their multiplication). An effective solution of the underflow problem is to use log likelihoods and the logsumexp
trick.
First, we must use the Gaussian log likelihood instead of the linear likelihood, as
We must also use the log likelihood of the whole GMM instead of the linear likelihood, but this calculation is slightly more convoluted. Given that the formula performs likelihood additions, the use of does not pose any immediate advantage (because we cannot directly add log likelihoods). We use instead the logsumexp
trick LSE()
, which
states that
In LSE()
, we scale the terms in the summation by the largest term and convert the scaled terms to linear domain instead.
Let us give an example of the logsumexp
trick at work. Let and . We wish to compute . The direct evaluation of
requires calculating and , which causes underflow (regardless of the representation, float
or double
), and therefore . Using logsumexp
, this is .
The GMM log likelihood can be expressed as . Given that we already calculate instead of , the GMM log likelihood becomes
Analogously, the E-step becomes
and the responsibilities are computed from , i.e., .
The M-step does not require any changes to prevent underflows. Finally, the global GMM log likelihood becomes
The second main problem in a robust GMM implementation is the appearance of singular matrix inversions. This issue commonly arises with low-variance patches. For example, an image with a section of saturated pixels (e.g., camera is pointing to a light) contains an area with constant color and zero variance. If we attempt to train a GMM in such an image, all pixels in the zero-variance patch will be clustered together, but the evaluation of the Gaussian log likelihood will fail because it requires an inverse covariance matrix. A patch with constant color has a singular covariance matrix (all zeros), which is not invertible.
The simplest solution to this problem is to add bounds to the computation of the estimated covariance matrices. In particular, after evaluating each covariance matrix, we evaluate its reciprocal condition number
In well-conditioned matrix inversions, RCOND
is close to 1, whereas it approaches zero for ill-conditioned (close to singular) matrix inversions. In our implementation, we monitor each matrix so that , with . If this condition is not met,
we force to be a diagonal matrix with small (but well-conditioned) variance.
The training of a GMM requires some initialization for the means and covariances. A common approach is to use K-Means as a starting point. In our case, we implemented a basic K-Means algorithm with Forgy initialization. We use the output cluster centroids and cluster variances to initialize our GMM distribution, with the cluster centroid becoming the GMM means and the cluster variances becoming diagonal covariance matrices .
Clustering algorithms like K-Means and GMM show slower convergence properties when the data is badly scaled, or if there is a great disparity in the variance of different features. A common solution to this problem is to perform a data whitening step prior to clustering. To whiten a data set, we rescale each feature (e.g., the R, G, and B channels in an RGB pixel) in the feature vector so that it has unit variance. Consider the scaling matrix
The data whitening of the feature vector is then
We use the whitened data set as an input to K-Means and then to GMM. After the GMM has converged on the whitened data, we rescale the whitened means and covariances to their original values. In particular, and .