huggingface / evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

Home Page:https://huggingface.co/docs/evaluate

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

evaluate/metrics/mean_iou computes recall (sensitivity) instead of IoU

FlorinAndrei opened this issue · comments

There are several issues with the mean_iou code here:

https://github.com/huggingface/evaluate/blob/c447fc8eda9c62af501bfdc6988919571050d950/metrics/mean_iou/mean_iou.py

The most important is that it actually computes recall (sensitivity) instead of IoU. The root cause appears to be the way mask is computed only from label, but then the same mask is applied to both pred_label and label (lines 144-149):

mask = label != ignore_index
mask = np.not_equal(label, ignore_index)
pred_label = pred_label[mask]
label = np.array(label)[mask]

intersect = pred_label[pred_label == label]

Because both pred_label and label are masked with pixels from label only, the result of the computation in that function is the ratio of intersection and label (recall), instead of the ratio of intersection and the union of prediction and label (IoU).

It's a subtle error that is hard to discover because both IoU and recall have values between 0 and 1, and both behave similarly in training.

The problem is, recall is higher than IoU, which then leads to an overestimate of model performance. The unfortunate side-effect is that I've wasted a lot of time training a SegFormer model based on wrong assumptions.

I've only discovered this because I wrote my own metric functions, starting from TP / TN / FP / FN, and then from those four values I've computed Sorensen-Dice (a.k.a. F1-score), precision, recall, and (on a whim) IoU. This is my code (it's not optimized, the function docstrings are wrong, but it works):

https://gist.github.com/FlorinAndrei/da9ab770b16bfc671075d04a030f548b

I was very confused initially when my IoU was different from evaluate/metrics/mean_iou. But then I noticed my recall was the same as "IoU" from evaluate/metrics/mean_iou. I've checked my code in a few different ways and I believe it is correct.

Here's a visual sample:

metrics

eval/iou_lesion is the result from evaluate/metrics/mean_iou. eval/loss is just the evaluation loss. The rest are computed by my code. eval/niou_lesion is IoU computed by my code. Notice how the library code produces identical results to the recall value from my code.

My code has only been tested with SegFormer, and only for datasets with a single class, plus background, where the label pixels are 1 and the background is 0. I have not tested it for multiclass segmentation. I have not tested reduce_labels = True.

@lvwerra @lhoestq @mariosasko @lewtun @dleve123 @NielsRogge

Hi,

Thanks for creating this issue and investigating! I've ported the mIoU metric directly from OpenMMLab's implementation, which can be found here.

It's an almost line-by-line copy actually, I've replaced PyTorch operations by NumPy as evaluate is a framework-agnostic library. So we should probably check whether there was a mistake in my code when porting it.

The lines of code you are referring to were taken from here. The binary mask is applied to both the predicted map and the ground truth map to remove pixels which have a ground truth label that should not be included in the metric calculation. During porting, I computed mIoU using both mine and their implementation, and verified whether they returned the same result.

cc'ing @alaradirik here as well

I know it's tedious to do a thorough verification, but take the route I took, compute basic metrics like TP (intersection of mask and prediction), TN (outside both mask and prediction), FP (within prediction, outside mask), FN (outside prediction, within mask), then using those calculate more complex metrics such as IoU, Dice, precision, recall. You will then notice your implementation actually matches recall exactly.

IoU = TP / (TP + FP + FN)

recall = TP / (TP + FN)

Copy/paste only generates correct code if the source is also correct.

Let's compare the library code with my code.

Generate a prediction and a ground truth mask (label). Each take half the image. The prediction and the mask only overlap for 1/4 of the total image. Yellow is the prediction and the mask, purple is the background.

import numpy as np
import evaluate
from matplotlib import pyplot as plt

pred = np.zeros((200, 200), dtype=np.int8)
mask = np.zeros((200, 200), dtype=np.int8)

pred[100:, ...] = 1
mask[..., 100:] = 1

f, axs = plt.subplots(1, 2)
axs[0].imshow(pred)
axs[1].imshow(mask)

pred_mask

It's obvious IoU is 0.333, while recall is 0.5. Let's check the library code, and my code containing the IoU fix (which is in a separate module I called mean_dice but also generates IoU and other metrics):

metric_orig = evaluate.load(
    "evaluate/metrics/mean_iou",
)
metric_dice = evaluate.load(
    "evaluate/metrics/mean_dice",
)

metrics_orig = metric_orig._compute(
    predictions=[pred],
    references=[mask],
    num_labels=2,
    ignore_index=0,
    reduce_labels=False,
)
metrics_dice = metric_dice._compute(
    predictions=[pred],
    references=[mask],
    num_labels=2,
    ignore_index=0,
    reduce_labels=False,
)

print(f"IoU library:    {metrics_orig['per_category_iou'][1]}")
print(f"IoU fixed code: {metrics_dice['per_category_niou'][1]}")

Results:

IoU library:    0.5
IoU fixed code: 0.3333333333333333

@NielsRogge

Thanks for the detailed report. So if you compute the metric with ignore_index = 0, all pixels that are labeled with label = 0 (oftentimes, the "background" class) in the ground truth segmentation map are ruled out.

image

Hence, in the figure above, the ground truth segmentation map is on the right, and due to ignore_index=0, we will only look at the right (yellow) side of that map. As we can see, half of that (50%) is predicted correctly. IoU is defined as TP / (TP + FP + FN) or also (area of overlap)/(area of union). In this figure, the area of overlap is 100x100 pixels, and the area of union is the yellow region, so 100x200 pixels. Hence IoU is calculated as (100x100)/(100x200) = 0.5. This is why our implementation (and the way I understand OpenMMLab also calculates it) returns 0.5.

If ignore_index would not be set, then IoU is (area of overlap)/(area of union) = (100x100 + 100x100)/(200x200) = 0.5.

Intersection and union are defined between two sets of pixels: the prediction, and the label.

The intersection between prediction and label is the set of pixels that belong to both prediction and label. It is, indeed, the 100x100 square on the bottom right.

The union between prediction and label is the set of pixels that belong to either prediction or label. In other words, when looking at both prediction and label, any yellow pixel in either one of them will belong to the union. It is therefore, three quarters of the total image, or three 100x100 squares.

Please refer to the definition of IoU / Jaccard index:

https://en.wikipedia.org/wiki/Jaccard_index

I believe this wiki diagram explains it quite well:

Intersection_over_Union_-_visual_equation

In other words, when looking at the union of two sets, elements from both sets need to be considered.

To take it to an extreme, if one set was the left half of the image, and the other set was the right half of the image, their union would be the whole image (and their intersection would be zero).

Here's an example of IoU defined correctly in the context of image segmentation, from the OpenCV documentation:

https://learnopencv.com/intersection-over-union-iou-in-object-detection-and-segmentation/

Another thing to think about: union is commutative. In other words, the union of A and B is the same as the union of B and A. The way you explain it, it is not commutative. That should raise a red flag immediately.

I am not familiar with OpenMMLab but, based on your explanations, it looks like they are wrong w.r.t. simple notions of set theory and the definition of the Jaccard index.

ignore_index has nothing to do with the definition of the Jaccard index. I'm guessing it was meant, originally, to exclude some class from the calculation of the mean IoU, which would make sense when you're only interested in objects and you don't care about the background. Chopping arbitrary parts out of the label frame, based on the index value, makes no sense.

In other words, if there is some class you don't care about, simply exclude it from the mean, and don't waste compute at all on it. Do not remove pixels from the label frame from consideration, based on that pixel value - or you will ignore key parts of the model's output in training/evaluation.

Could you try to compute mIoU using OpenMMLab's implementation and see whether you get the same result?

Made a quick notebook to illustrate usage: https://colab.research.google.com/drive/1d5bwM_yjC-BA0EvSpfaAYcrzL9geHi1o?usp=sharing. This will allow us to determine whether it's a mistake on my side or whether OpenMMLab calculates it effectively this way.

They have the same problem. ignore_index ignores chunks of both prediction frames and label frames based solely on the distribution of a pixel value in the label frame, which is bizarre. They create a mask for the mask. Ultimately, that leads to the same wrong value for IoU.

It's fine to use custom metrics to evaluate the performance of a model, but IoU has a specific definition, and what comes out of this computation is not IoU. It is literally recall (sensitivity).

Could you open an issue on mmsegmentation to flag this? mmsegmentation is regarded as the biggest research effort for semantic segmentation, lots of researchers (like the authors of SegFormer) are using it to develop new models.

They can potentially chime in on the discussion, give a reasoning for it

Hey @FlorinAndrei and @NielsRogge , I came about this issue when I was looking at another issue from Florin over here and am just leaving this comment here to see how it goes from here. Are we going to wait for @mmsegmentation to make changes to their code base and then port it here or can we start working on it on our own?