A Metric For Assessing The Quality of Low-Rank Models

A generalized approach to computing the Coherence of a low-rank model provides an objective measurement of its quality.

NOTE: THIS REPO IS THE PROCESS OF BEING CLEANED, APOLOGIES

Background

A low-rank model (LRM, e.g. PCA) is frequently used as part of the greater pipeline in preparing a cluster model.

An unsupervised learning task, a LRM can not be assessed against a label vector, but must be measured against some intrinsic quality of the data.

Coherence

Semantic Coherence is a metric specific to the domain of topic modeling. We propose a metric Coherence, more general than Semantic Coherence, and capable of being applied to a more general category of LRMs.

The Metric: Coherence

Low-rank models typically generate a loadings matrix, L, as a by-product of the model fitting process.
The values of a loading matrix column represents the expression of each original feature vector in the corresponding low-rank model column.
These values can be used to compute a scaled mutual information for each pair of vectors.
A sum taken over all pairs is an intrinsic measurement of the mutual information in the given column vector. We call this sum coherence.

Results

Preliminary results on this small data set show that cluster models using LRM vectors selected using coherence to be more performant than models generated by maximizing explained variance.

joshuacook / coherence

A Metric For Assessing The Quality of Low-Rank Models

Background

Coherence

The Metric: Coherence

Results

About

Languages