Doubt Reason Based on Entropy
koaning opened this issue · comments
If a machine learning model is very "confident" then the proba scores will have low entropy. The most uncertain outcome is a uniform distribution which would contain high entropy. Therefore, it could be sensible to add entropy as a reason for doubt.
I wonder ... what's a reasonable threshold here?
I see two ways to use this:
- Return all predictions with a high uncertainty; E > T1
- Return all predictions with a high certainty, that don't match the dataset label; argmax(P) != Y, E < T2
I've been thinking about your question about the threshold, but I haven't been able to figure out a reasonable threshold value. I've been combing through some litterature related to this, but if such a threshold is used, it is often just a hyperparameter that is tuned, without a theoretical argument.
One thing that might help is to use the Normalized Shannon Entropy, since entropy values for distributions with a different number of classes are difficult to compare. A method that I could see working would be to determine the threshold relative to the entropy distribution of the dataset. The first thing that comes to mind would be to consider the lowest/highest percentiles, although I think there are more clever tricks available.
Normalized entropy, as described here seems like a sound idea! Thanks for the mention 👍 I think I'm fine with keeping the threshold as a hyperparameter in this entropy-reason if that prevents adding an assumption to the stack. I think it'd be good to gather feedback anyway.
Return all predictions with a high certainty, that don't match the dataset label; argmax(P) != Y, E < T2
I'm wondering ... is this something best addressed via WrongPredictionReason. We may want to add a hyperparameter there for this use-case.
Hi!
I created a PR for version 1 of the entropy reason here. I went for a threshold of 0.5, just because it worked well for the iris dataset. 0.2 would have produced way too many non-zeros.
Best
Robert
Another way to tackle the "wtf should the threshold be" problem: Maybe we can specify a quantile instead of an absolute threshold like 0.5. This means that we can specify some quantile alpha and then only just flag a share of alpha samples having the highest normalized Shannon entropies.
I'm wondering ... is this something best addressed via WrongPredictionReason. We may want to add a hyperparameter there for this use-case.
We also the ShortConfidence
reason and the LongConfidence
reason.
Maybe we can specify a quantile instead of an absolute threshold like 0.5.
Part of me likes the idea. But I'm worried that we may introduce a lot of hyperparams and that at the moment it's unclear how much more useful doubt based on entropy will be compared to the margin-based reason.
I think it's possible to use Hoover index instead of entropy: it's easier to compute, it is always in 0-1 range and has clear explanation (0 - equality/uniformity, 1 - inequality).
There is also a bigger problem with this approach in multiclass setting: assume you have 4 classes, if your probas are 0.25-0.25-0.25-0.25 then entropy/uniformity measure will correctly find them, but if you have something like 0-0 5-0.5-0 than it will fail, but this sample still could be mislabeled. This problem becomes even more sever with more classes. Straightforward solution would be to use one-vs-rest scheme.
I'm wondering ... can we come up with a situation where entropy based doubt can adress issues that the other reasons cannot?
Fixed by #24