uncertainty-estimates

Objective

A project to find out a practical approach for quantifying the uncertainty of the model predictions.(in deep neural networks). Thus, training an auxillary generative model doesn't make the cut as it introduces a completely new problem.(accidental complexity). The objective is to strive to find a simple approach,that can be widely used in the industry. This is crucial for deploying many ML/NN based models in high-stakes environments as we should know when *we can't trust the model outputs.

Backgroud

Naively using softmax probabilities to detect/quantify is wrong. They do not work. This has already been shown. Typically when fed unseen/out-of-sample data-point, softmax probabilities will still be high and would be statistically undecipherable from other insample points.

Fate of naively using softmax outputs

Below plot shows the histogram of the model(Resnet18)-entropies trained on MNIST on three data-groups - unseen MNIST samples,FMNIST samples,Some ambiguos MNIST samples(these are generated via an VAE ,esentially images that cannot be put into a unique class,like a digit that looks both like a 3 and 9.).

I will spare further details as this is a work in progress but there exits a simple approach to not only identidy model uncertainty but also disambiguate between alaetoric and epistemic uncertainties. Check the below plot, on same samples, using a simple method, we can seperate between same-distribution but oos vs different distribution points.

Potential Questions to dig in

How to make it work with regression tasks ?(can we still disambiguate there)
Effect of architechtural changes, using transformers as a building block and testing on various tasks in NLP,CV etc.
Use this as novely detector and use it to direct exploration...rather than typical entropies used in RL.

Research scratch pad.(work in progress)

Below is the notes to myself as i explore this problem.

Step-1: Literature Survey

ref1: Safer Classification by Synthesis

Synthesis

The probability values of Discriminative classifiers can be high even when the input is coming from out-of-sample distribution. This can be quantified by plotting coverage vs risk (by varying the classification threshold). Ideally when sufficiently large threshold is chosen, misclassification-rate should be zero.
IDEA:Generative Classification Since discrimative classification can be seen as finding the projecting of the data onto a manifold such that seperating hyperplanes neatly segregate the data. Now, we can see how even for an out-of-distribution sample, we can have high-confidence prediction. Thus, We can try to model the distance from the boundary directly to quantify the confidence.
Let's say we have a class-conditional generative model M. For a given test image x, find an image from each class that is closest(some distance-metric like l2) to it via the generative model, and label it's class to be the one that has highest similarity(least distance). Here we can use a threshold to determine if we can assign a class label or not. Here we might be able to avoid the below case.
Conclusion: Use the generative model for thresholding. Use a CNN to predict the labels on these images. This gives us both the accuracy and shields us from making predictions on out-of-sample data-points.

Implementation concerns

Monte-carlo sampling and L-BFGS is needed to find the optimal image for each class.

Missing Pieces/Think on these things

Is the additional complexity warranted ? Can we use simpler approaches ?
Generating images/samples is not needed for the task at hand. We can fit the generative model on a lower dimensional projection vectors.
Decoupling of objectives is typically a bad idea in deep learning. Can we frame a sigle network/objective to perform both novelty detection and classification ? Note that the paper is quite old.
Find some real-world data with data-shift,to ultimately bench mark your approach. Also think/quantify exactly what do you mean by out-of-distribution and uncertainity(epistemic vs aleatory uncertainty). Bayesian NN's seem to be a direct candidate here. Ideas: Since descriminative classification doesn't explictly have a distance metric, we can train networks with explicit distance metric for classification and use thresholding to detect out-of-sample distributions. Here's another paper,Deterministic Neural Networks with Inductive Biases Capture Epistemic and Aleatoric Uncertainty,Pitfalls of OOD detection that does exactly this.

Ideally we would want to classify samples into a. Unseen out-of-sample(alaetoric) b. Too hard to detect into a class(epistemic uncertianty). This would play a big role in debugging and improving the models. Moreover Generative models aren't great at identyfying OOD samples.,this

NOTES

On Simpler Approaches

Train an ensemble of models(can be done within a single model with multiple heads) and use the std as a threshold to decide for novelty and the avg as the class label.

ref 2: A survey on uncertainty estimation

ref 3: NeurIps Shifts Challenge

Here are some cool empirical studies: Tabular data: Paper and Code Image data :Paper Guidelines to tackle realworld shifts

ref 3: Collection of real-world datasets with distribution shift

Final Experimental Setup

DataSets: MNIST(training data),DIRTYMNIST(in-d but uncertain samples),FASHIONMNIST(to get ood samples) Perform temperature scaling as done here to compare the log-likelihood between the models. Baseline: 1. DeepEnsembles.(multiple models+possibly trained with adverserial training) Above is SOTA interms of uncertainty quantification. But only bottleneck being

1.Doesn't really differentiate between ambiguous inputs(alaetoric) and Clear out of sample inputs. 2. Training cost.

Can we have a simpler single model ?

Naively trained discriminative classifiers cannot do both classification and out-of-sample detection both efficiently. Reasons and a good first approach is given here:Deterministic Deep Learning via Distance Awareness. It's single forward pass but a considerable departure from typical training. Here's a far simple approach - code,paper. Implement and compare with this model.