Gaussian Mixture Model From Scratch - Include Expectation-Maximization Algorithm

Table of Contents

Introduction
Components of Learning
Expectation-Maximization Algorithm
Pseudocode of GMM Algorithm
Test make_blobs Dataset

Introduction

Modelling real dataset using single Gaussian distribution can suffers from significant limitations while kernel density of dataset showing two or more Gaussians distribution. The algorithm work by grouping instance into certain groups or cluster that generated by a Gaussian distribution. Gaussian Mixture initialize the covariances of the cluster based on the covariance type that represent the distribution of each cluster.

Figure 1 Gaussian Mixture Model toward data

While K-Means algorithm work by grouping instance into certain cluster based on the closest distance between the points to the centroid using Euclidean distance, K-Means algorithm also does not estimate the covariances of the cluster. (Figure 2)

Figure 2. K-Means Clustering toward data

GMM model is a probabilistic model that assumes the instances were generated from two or more certain Gaussians distribution whose parameters is unknown. All the instances generated from a single Gaussian distribution form a cluster that typically looks like an ellipsoid with different ellipsoidal shape, size, density, and orientation.

Figure 3.Gaussian cluster with different ellipsoidal, shape, size, density and orientation

Let’s see how dataset with two or more Gaussians Distribution by visualize it into two dimensions (Figure 4). Sometimes the dataset contains the superposition of two Gaussians cluster, which Gaussian Mixture will tease it apart, whichever point to have the highest probability to have come from each Gaussians is classified as belonging to a separate cluster.

Figure 4. Two class with its own Multivariate Gaussian distribution (Left), Consider we mixed the dataset and remove the label, composite dataset would not be Gaussian, but we know it’s composed of two Gaussians distribution (Right).

(back to top)

Components of Learning

Hypothesis:
instances were generated from two or more certain Gaussians distribution whose parameters is unknown.
Parameters:
Weights, Mean, Covariance, Responsibilities.
Learning Algorithms:
Objective: Maximize weighted log-likelihood

by: Expectation-Maximization method
Input:
X: Input data
Output:
Centroids
Labels
Hyperparameters:
k : number of centroids
n_init : number of initialization
max_iter : maximum iteration
tol : Change of norm weighted log-likelihood (converged parameter)
covariance type : {full, tied, spherical, diag}

(back to top)

Expectation-Maximization Algorithm

1. Gaussian Mixture

Expectation-Maximization is a method in Gaussian algorithm in order to finding maximum likelihood solutions for model with latent variable (McLachlan and Khrisnan, 1997). To give better understanding how EM method works, given a K-Dimensional binary random variable z having a 1-of-K representation in which a particular element $z_k$ is equal to 1 and all other elements are equal to 0. The value of zk therefore satisfy $z_k \ Є \ (0,1) \ \text{and} \sum{} z_k = 1$. This value become first initialization of our algorithm that stored in value named responsibilities. The value $z_k$ is filled based on 2-D array of ((n_clusters, n_features, n_features)), which the value of features in certain n-clusters will be valued as 1 and others will be 0. It simply store as 1 in order of which cluster is the closest one to the point-i.

We shall define the joint distribution $P(x,z)$ in terms of marginal distribution $P(z)$ and a conditional distribution $P(x|z)$ (Figure 5). We noted that the mean of the conditional distribution $P(x_a|x_b)$ was a linear function of $x_b$.

Figure 5. (left) shows the contours of a Gaussian distribution P(xa, xb) over two variables, (right) shows the marginal distribution P(xa) (blue curve) and the conditional distribution P(xa|xb) for xb = 0.7

Here we shall suppose that we are given a Gaussian marginal distribution $p(x)$ and a Gaussian conditional distribution $P(x|z)$ in which has a mean that is a liner function of x, and a covariance which is independent of x. The marginal distribution over z is specified in terms of the mixing coefficients $π_k$ with:

$$p(z_k=1) = π_k \tag{1}$$

Where the parameter $0 \leq{π_k} \leq{1}$, together given $∑_{i=1}^{K}\ π_k=1$.

Similarly, the condition distribution of $x$ given a particular value for $z$ is a Gaussian $p(x|z_k=1)$ describe by:

$$p(x|z_k=1) = \mathcal{N}(x | \mu_k, ƹ_k) \tag{2} $$

Then Joint distribution is given by $p(z)p(x|z)$, and the marginal distribution of x which $p(x)$ is obtained by summing the joint the distribution over all possible states of z to give $p(x) = p(z)p(x|z)$. it follows that for every observed data point $x_n$ there is a corresponding latent variable $z_n$

$$p(x) = \sum_{z} p(z)p(x|z) = \sum_{k=1}^{K} π_k \mathcal{N}(x|\mu_k, ƹ_k) \tag{3}$$

Moreover, having joint distribution $p(x|z)$ instead of marginal distribution $p(x)$, and this will lead to introduction of Expectation-Maximization Method (EM). Another entity that having important in this algorithm is as well as conditional probability of z given by x or posterior probabilities. This conditional probability is well known as responsibility or $γ(z_k)$ that denotes the value of $P(z_k =1|x)$, with the value found by using Bayes Theorem as well.

$$γ(z_k) = p(z_k=1|x) = \frac{π_k \ \mathcal{N}(x|\mu_k, ƹ_k)}{∑_{j=1}^{K} π_j \ \mathcal{N}(x|\mu_j, ƹ_j)} \tag{4} $$

This value of $γ(z_k)$ is explain as the the responsibility that component $k$ takes for explaining the observation of $x$.

Figure 6 Component of Gaussian mixture shown by marginal distribution for each gaussian

(back to top)

2. Maximum Likelihood

Suppose we want to model dataset with Gaussian Mixture using dataset of observation {${x_1, x_2, …., x_N}$}, we can represent this data as $N x D$ matrix $x$ in which nth row and corresponding latent variable that also denoted by $N x K$ matrix $z$. We can express the log likelihood function given by.

$$ln \ p(x|\pi, \mu, ƹ) = \sum_{n=1}^{N} ln \ (\sum_{k=1}^{K} \mu_k \ \mathcal{N} (x_n|\mu_k, ƹ_k)) \tag{5}$$

However, there’s significant problem associated with the maximum likelihood due to the presence of singularities. Suppose the one of components of the mixture model $j^{th}$, has its mean $\mu_j$ exactly equal to one of the data points so that $\mu_j = x_n$ for some value of $n$.

This data point will then contribute a term of likelihood function of norm gaussian.

$$\mathcal{N}(x_n|\mu_n,ƹ_j^ 2{I}) = \frac{1}{(2 \pi)^\frac{1}{2}} \frac{1}{\sigma_j} \tag{6}$$

Which we consider the limit $\sigma_j = 0$ as denominator, then this norm of likelihood will goes to infinity.

(back to top)

3. Expectation-Maximization Method (EM)

GMM algorithm first initialize mean $(\mu_k)$, weight $(\pi_k)$, covariances $(ƹ_k)$ by setting the derivatives of $ln \ p(x|\pi, \mu, ƹ)$ in with respect to the means and covariances.

$$\mu_k = \frac{1}{N_k} \sum_{n=1}^{N} γ(z_k) \cdot x_n \tag{7}$$

$$ƹ_k = \frac{1}{N_k} \sum_{n=1}^{N} γ(z_k) (x_n-\mu_k)^T \cdot (x_n-\mu_k) \tag{8} $$

With $N_k$ is defined as the effective number of points $(x_n)$ assigned to cluster $k$

$$N_k = \sum_{n=1}^{N} γ(z_k) \tag{9} $$

$$\pi_k = \frac{N_k}{N} \tag{10} $$

In the expectation (E-Step), we use the those initial values for the parameters to evaluate the posterior probabilities, or responsibilities $γ(z)$ as shown in equation 4

Then use these responsibilities $γ(z)$ in the maximization M-Step to re-estimate the weight ($\pi$), mean ($\mu$),and covariance($ƹ$)

(back to top)

Pseudocode of GMM Algorithm

Figure 7 EM algorithm pseudocode

In order to initialize the values of responsibilities, $γ(z_k)$, it is necessary to first define a parameter initialization to determine the initial centroid of the cluster and subsequently find the other centroids using the k-means++ algorithm. An object, represented as a 2-D array, is created to store the values of responsibilities with a configuration of (N x K). By examining the distance between each sample (n) and the clusters, we can determine the index or identify which cluster is the closest to each sample. For the responsibilities associated with sample n and cluster k, a value of 1 is assigned, while the responsibilities for the other clusters are set to 0.

(back to top)

1. Initialization of weight, mean, and covariance

The initialization values of ($\pi_0$), ($\mu_0$), and ($ƹ_0$) are calculated based on the previously obtained responsibilities. However, the values of cluster covariances depend heavily on the covariance type defined earlier. The value of $N_k$ is determined initially by summing the responsibilities matrix for each sample, as indicated by the equation 9 and stored as [K] matrix components. The calculation of ($\mu_0$) is performed using equation 7, which involves taking the dot product of the responsibilities matrix with the number of samples ($X_n$) and dividing it by $N_k$ and stored as [[K x D]] matrix components. The covariances ($ƹ_0$) are stored as an ndarray with dimensions [[K x D x D]]. Performing the same calculations, we can obtain the initial values for ($\pi_0$), ($\mu_0$), and ($ƹ_0$) based on the responsibilities $γ(z_k)$ obtained earlier.

Figure 8 Initialization of weight, mean, and covariance

In GMM parameter estimation, one common approach is to estimate the mean and covariances matrix for each component using the Expectation-Maximization (EM) Algorithm. However, estimating the covariance matrix directly can be challenging because it needs to be positive definite. In some cases, it may become ill-conditioned or singular, causing numerical instability or convergence issues. To overcame these challenges, Cholesky Decomposition is often employed. During the E-step, instead of estimating the covariance matrix directly, the Cholesky Decomposition is applied to the covariance matrix. The Cholesky decomposition express the covariance matrix as the product of a lower triangular matrix and its transpose.

The Cholesky decomposition of a Hermitian positive-definite matrix A is a decomposition of the form A = [L][L]^T , where L is Lower triangulation matrix with real and positive diagonal entries, and L^T denotes the conjugate transpose of L. Every Hermitian positive-definite matrix (and thus also every real-valued symmetric positive-definite matrix) has a unique Cholesky decomposition. Using Cholesky decomposition avoid Singularity Effect as the main issue of reaching maximum likelihood by ensures precision matrix (inverse covariance) remains positive definite for a valid Gaussian distribution.

Figure 9 Algorithm of Precision Cholesky

(back to top)

2. Expectation Step (E-Step)

Expectation step finding the best normalized log likelihood in N x K components. Normalized log likelihood is calculated using equation 11, by performing scipy.logsumexp on. Expectation step also return log responsibilities $ln \ γ(z_k)$, as the value of difference between weighted log likelihood and normalized log likelihood.

$$ln \ p(x, z |\mu, ƹ, \pi) = \sum_{n=1}^{N} \sum_{k=1}^{K} z_{nk} (ln \ \pi_k + ln \ \mathcal{N}(X_n|\mu_k, ƹ_k)) \tag{11} $$

Log Determinant

First step in expectation is to calculate log determinant of Cholesky Decomposition (Inverse covariance) which provide information about the volume or spread of the Gaussian distributions in the GMM.

Log Gaussian Probabilities

log gaussian probabilities $p(x|z)$ computes the transformed data point of each component using the precision matrix and mean, and then calculates the Square Euclidean distance between the transformed data point and the mean.

$$ ln \ \mathcal{N}(x_n|\mu_k, ƹ_k) = \frac{1}{2} (D \cdot ln(2\pi) + log prob) + log det \tag{11}$$

Figure 10 Algorithm of Log Determinant and Gaussian Probabilities

Weighted Log likelihood

sum of Log Gaussian probabilities which refer to $p(x|z)$ and log weight as $ln \ \pi$

$$ ln \ p(x, z|\mu, ƹ, \pi) = \sum_{n=1}^{N} \sum_{k=1}^{K} z_{nk} \ ( ln \ \pi_k + ln \ \mathcal{N}(x_n|\mu_k, ƹ_k)) \tag{12} $$

This value will return as final value if converged is reached, and it will determine the probabilities of which cluster will be assigned to.

Figure 11 Algorithm of weighted log likelihood

Log Responsibilities

Log responsibilities will return as value of difference between weighted log likelihood and normalized log likelihood

$$ ln \ γ(z_{nk}) = ln \ p(x, z|\mu, ƹ, \pi) - ln \sum^{N}\sum^{K} e^{ln \ p(x, z|\mu, ƹ, \pi)} $$

Figure 12 Algorithm of log responsibilities

(back to top)

3. Maximization Step (M-Step)

This log responsibilities $\ln γ(z_{nk})$ is used in Maximization (M-Step) in order to initialize the new value of $\pi_k$, $\mu_k$, $ƹ_k$ and inverse covariancek (prec chol) that will be used in next Expectation (E-Step) if converged is not reached.

Converged is reached when change of our normalized log-likelihood of i-iteration and our normalized log-probabilities of previous iteration is lower than tolerance (default = 0.001) that we determined before. Then, the parameter of $\pi_k$, $\mu_k$, $ƹ_k$ and inverse covariance (precison cholesky) in certain iteration is stored as the best parameter of t (n_init).

Figure 13 Maximization step

(back to top)

4. Predicted clusters of n_samples

Figure 14. GMM Predict pseudocode

Predict function will return the argmax of log responsibilities that determine which cluster will be assigned for point -n based on which cluster has the highest probabilities.

Figure 15. Log Responsibilities argmax

(back to top)

Test make_blobs Dataset

Let's try to create dataset from sklearn.dataset.make_blobs to investigate how the algorithm works.

n_samples = 1000

varied = datasets.make_blobs(n_samples=n_samples,
                             centers=2,
                             n_features=2,
                             cluster_std=[4, 1],
                             random_state=3)

X, y = varied[0], varied[1]

plt.figure( figsize=(10,6))
plt.scatter(X[:,0], X[:,1], c=y, s=50, cmap=plt.get_cmap('viridis'), **scatter_style)
plt.title("Make_blobs dataset", **title)

Let see how this data works as one cluster,

Figure 17. Original Dataset without cluster

Let us now proceed to compare the operational principles of the K-Means clustering algorithm with those of the Gaussian Mixture Model algorithm, concerning their application to the original dataset.

K-Means Clustering work by assigns each data point to the cluster whose centroid is closest to it, based on some distance metric (usually Euclidean distance).
Besides, Gaussian Mixture represents each cluster as a probability distribution characterized by its mean and covariance matrix, expressing the likelihood of data points belonging to different clusters as probabilities.

K-Means aims for a hard assignment of data points to clusters based on distance, GMM takes a soft assignment approach.

Image below is shown how the Gaussian mixture work without cholesky decomposition (inverse covariance) with number of iteration is [1, 5, 10, 15, 20, 25]

Figure 19. GMM ellipse without Cholesky decomposition

Image below is shown how the Gaussian mixture work with cholesky decomposition (inverse covariance) with number of iteration is [1, 2, 3, 4, 5, 6]

Figure 20. GMM ellipse with Cholesky decomposition

(back to top)

Reference:

Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer.
Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow. O'Reilly Media.
Liu, M., Zhang, Q., Dai, Y., & Sun, C. (2018). Free-Resolution Probability Distributions Map-Based Precise Vehicle Localization in Urban Areas. IEEE Transactions on Intelligent Transportation Systems, 19(2), 546-559.
Albon, C. (2020). Gaussian Mixture Models and Cluster Validation. Gaussian Mixture Models and Cluster Validation
Scikit-Learn. Github

DandiMahendris / Gaussian-mixture-from-scratch

Gaussian Mixture Model From Scratch - Include Expectation-Maximization Algorithm

Introduction

Components of Learning

Expectation-Maximization Algorithm

1. Gaussian Mixture

2. Maximum Likelihood

3. Expectation-Maximization Method (EM)

Pseudocode of GMM Algorithm

1. Initialization of weight, mean, and covariance

2. Expectation Step (E-Step)

3. Maximization Step (M-Step)

4. Predicted clusters of n_samples

Test make_blobs Dataset

Reference:

About

Languages