[BUG] AutoGMM does poorly on well-separated clusters

Question

[BUG] AutoGMM does poorly on well-separated clusters

ebridge2 opened this issue 2 years ago · comments

Expected Behavior

an example use-case can be come up with that I can demonstrate/justify for the book

Actual Behavior

The predicted number of clusters/clusterings are not accurate nor close to accurate. Seems to always favor extremely high number of clusters. I have played around with different settings for about 2 hours and cannot find one where GMM does appreciably better than K-means, and the number of clusters predicted is even close to the true number of clusters (seems to usually be 8 or 9)

Example Code

3 near-perfectly gaussians separated with extraordinarily high probability (density of overlap between the different gaussians is ~0), and I cannot seem to get AutoGMM to give me a good clustering where KMeans does appreciably worse, and AutoGMM gives me something within the ballpark of the true number of clusters. Maybe that's fine?

Step 1 generates the latent positions...

from graspologic.simulations import rdpg

pi = np.array([0.33, 0.33, 0.34])
zs = np.random.choice([0, 1, 2], replace=True, p=pi, size=200)
# the means
mus = np.array([[-.7, .7, 0], [.3, .3, .8]])
# the covariances
covars = np.stack(([[.005, .05], [.05, .8]], [[.005, -.05], [-.05, .8]], [[0.002, 0], [0, 0.002]]), axis=2)
np.random.seed(1234)
Xtrue = np.array([np.random.multivariate_normal(mus[:,z], covars[:,:,z]) for z in zs])
P_rdpg = Xtrue @ Xtrue.T
A = rdpg(Xtrue)

and plot it...

_ = pairplot(Xtrue, labels=zs)

Step 2 spectrally embeds...

Xhat = AdjacencySpectralEmbed(n_components=3).fit_transform(A)

Step 3 performs the clustering...

from graspologic.cluster.autogmm import AutoGMMCluster

autogmm_clust = AutoGMMCluster(max_components=10, random_state=1234)

labels_autogmm_erratic = autogmm_clust.fit_predict(Xhat)

Your Environment

Python version:
graspologic version:

Additional Details

Any other contextual information you might feel is important.