What is the best way to regularize supervised UMAP?

Question

What is the best way to regularize supervised UMAP?

idekany opened this issue 5 months ago · comments

I am working on a regression problem where I am attempting to use UMAP for supervised feature embedding, and use the resulting low-dimensional embeddings as input variables of a subsequent regression model.

Using the L2 target metric, UMAP successfully transforms my very high dimensional sparse feature matrix into a low-dimensional one for the training set. Visualizations of the resulting embeddings show a nice and smooth variation of the target variable with the UMAP features. However, when I embed the data of my validation set using the trained UMAP model, I see that the model generalizes poorly to unseen data, which appears to be a typical overfitting problem.

I tried to perform hyper-parameter optimization of the composite model comprising UMAP + the regression model on top of it by maximizing the R2 metric of the regression on top of UMAP. By tweaking several hyper-parameters of the UMAP model, so far I have been unable to effectively achieve a good bias-variance tradeoff of the embedding. The manual also does not seem to address the question of how to regularize supervised UMAP.

I would be extremely grateful for any advice as to which hyper-parameters to focus on in order to achieve better generalization properties of the supervised UMAP algorithm.

AMS-Hippo · Answer 1 · Wed Aug 14 2024 06:07:17 GMT+0800 (China Standard Time)

I don't have a good answer. Since nobody else has answered & I like the question, I thought I'd follow up with a bad answer and an offer to try some things out.

Clarifying the question: It isn't 100% clear to me what terms like "bias," "variance," and "overfitting" should mean in the context of supervised embedding without any underlying data model. Based on your description of what you're looking for in your visualization, I'm going to guess that you basically want the combined workflow of (UMAP)+(KNN regression) to not overfit - that is, if KNN regression fails on your test set, the labels of the training data shouldn't vary smoothly with the embedding!

I don't think there is anything in UMAP that explicitly tries to solve this regularization problem. Here are some elements of UMAP that do some regularization of similar-looking problems:

Quick-and-dirty: the most relevant hyperparameters are metric_scale in discrete_metric_simplicial_set_intersection (and also far_distance in the same function). These control how important the labels are, and e.g. setting metric_scale = 0 will ignore the labels entirely (and thus presumably completely get rid of overfitting if you agree with my definition above). I've used this in the past with OK results, but sometimes you need to essentially ignore the labels to avoid overfitting. Is this what you tweaked? Could you at least get it to not overfit this way, even if the resulting model was "bad"?
Quick-and-dirty with a model: the quick-and-dirty approach basically assigns a universal confidence score to each point (the single hyperparameter metric_scale). An alternative is to directly model the confidence at each point. The following tutorial does this for a different notion of confidence (the embedding dimensions 3-5, corresponding to a local estimate of a covariance matrix) using "Gaussian energy": https://umap-learn.readthedocs.io/en/latest/embedding_space.html. You could extend this to labels by including a model that embedded each point as a vector of the sort (embedded_point, confidence) and replacing the function gaussian_energy_gradient in the file "distances.py".
Tweaking things that aren't explicit parameters: after thinking about this, I tried doing a split on the training data as follows - mask all but 30% of the labels when training semisupervised UMAP, then use the remaining 70% when training logistic regression. Of course you lose some efficiency when splitting, but this shouldn't overfit on the labels and (based on very little experimentation) you don't seem to lose a lot of efficiency.

If you wanted to dive into the code a bit more, there are a few places where people have worked on regularization strategies that would be easy to add:

There is a large amount of work on regularization for neural networks. None of these tricks are currently implemented for parametric UMAP, but many of them would be fairly easy to implement. The simplest would be to simply use a penalty on the weights in the neural network - this exists as a default in keras/pytorch/etc, and would involve tweaks in lines 565-570 of https://github.com/lmcinnes/umap/blob/master/umap/parametric_umap.py. Other methods such as dropout could also be incorporated.
Continuing with parametric UMAP, the original "denoising autoencoder trick" could also be inserted to the autoencoder.

None of these approaches directly regularize the thing you're looking for. Some other approaches could easily be tweaked to do this (e.g. the fully model-based approach https://arxiv.org/abs/2304.07658), but the approaches I'm aware of in this direction are very slow.

I'm interested in this question but don't check github very often. If you feel like following up (especially if there is a version of the problem that you can share), my very-public-email-address is username asmi28, domain uottawa CA. I'm happy to write up some notebooks and such and post them if it feels interesting.