Behaviors of AUROC and Average Precision are inconsistent when all labels are equal

Question

Behaviors of AUROC and Average Precision are inconsistent when all labels are equal

weihua916 opened this issue a month ago · comments

🐛 Bug

When all labels are equal (either all zeros or all ones), the current implementation of AUROC and AveragePrecision have pretty different behaviors.
When labels are all ones, AUROC gives 0, while AP gives 1.
When labels are all zeros, AUROC gives 0, while Average Precision gives NaN.

I think it is better to add a flag such that both metrics would return NaN when all labels are equal to better inform users.

To Reproduce

>>> from torchmetrics import AUROC, AveragePrecision
>>> import torch
>>> auroc = AUROC(task = "binary")
>>> ap = AveragePrecision(task = "binary")
>>> preds = torch.randn(10)
>>> labels = torch.ones(10, dtype = torch.long)
>>> auroc(preds, labels)
/opt/homebrew/anaconda3/lib/python3.9/site-packages/torchmetrics/utilities/prints.py:43: UserWarning: No negative samples in targets, false positive value should be meaningless. Returning zero tensor in false positive score
  warnings.warn(*args, **kwargs)  # noqa: B028
tensor(0.)
>>> ap(preds, labels)
tensor(1.)
>>> labels = torch.zeros(10, dtype = torch.long)
>>> auroc(preds, labels)
/opt/homebrew/anaconda3/lib/python3.9/site-packages/torchmetrics/utilities/prints.py:43: UserWarning: No positive samples in targets, true positive value should be meaningless. Returning zero tensor in true positive score
  warnings.warn(*args, **kwargs)  # noqa: B028
tensor(0.)
>>> ap(preds, labels)
tensor(nan)

Expected behavior

when labels are all equal, it should return a NaN all the time.
at least, there can be a flag like equal_label_mode.

>>> ap = AveragePrecision(task = "binary", equal_label_mode = "nan")
>>> ap = AveragePrecision(task = "binary", equal_label_mode = "nan")

that give the expected behavior.

Environment

TorchMetrics version: installed via pip. '1.3.0.post0'
Python & PyTorch Version (e.g., 1.0): Python 3.9.12, torch 2.1.0
Any other relevant information such as OS (e.g., Linux): Linus

Nicki Skafte Detlefsen · Answer 1 · Tue Apr 16 2024 17:02:55 GMT+0800 (China Standard Time)

Hi @weihua916, thanks for raising this issue.
I created PR #2507 that is intended to close this issue. The intention behind our implementations are to match sklearn pretty close. By this I mean:

Averageprecision when all labels are 1 in sklearn returns a score of 1, which we are also doing
Averageprecision when all labels are 0 in sklearn returns a score of -0.0, where our implementation returns nan. That is not the intention and this will be fixed in PR #2507 to raise a user warning and return -0.0 similar to sklearn.
AUROC in sklearn completely fails in both the case when all labels are 1 and all labels are 0. We instead have chosen to raise user warnings that scores in both cases are essentially undefined and return the arbitrary score of 0. The reason for this is that other users have requested that metrics do not crash there code during training, which will also happens if the scores return nan. We therefore have chosen to go with a real, but arbitrary score.

Weihua Hu · Answer 2 · Thu Apr 18 2024 01:09:23 GMT+0800 (China Standard Time)

Thank you for addressing the issue! For AUROC, I personally still believe nan is better, since it's easy to convert nan to 0 outside of torch-metrics. Currently, the arbitrary AUROC score of 0 may be confused with the actual score of 0.

Nicki Skafte Detlefsen · Answer 3 · Thu Apr 18 2024 15:30:47 GMT+0800 (China Standard Time)

@weihua916 I do not necessarily disagree with you on that auroc should return nan and not 0, however we had overwhelmingly feedback when the metric was introduced in the beginning that this was to be preferred.

Weihua Hu · Answer 4 · Fri Apr 19 2024 07:27:35 GMT+0800 (China Standard Time)

Understood. Thanks for your consideration!