Doubt about MarginConfidenceReason :-)

Question

Doubt about MarginConfidenceReason :-)

dvsrepo opened this issue 3 years ago · comments

Hi Vincent,

Nice library! As mentioned a while ago on Twitter I'm doing a review to understand and compare different approaches to find label errors.

I'm playing with the AG News dataset, which we know it contains a lot of errors from our own previous experiments with Rubrix (using the training loss and using cleanlab).

While playing with the different reasons, I'm having difficulties to understand the reasoning behind the MarginConfidenceReason. As far as I can tell, if the model is doubting the margin between the top two predicted labels should be small, and that could point to an ambiguous example and/or a label error. If I read the code and description correctly, MarginConfidenceReason is doing the opposite, so I'd love to know the reasoning behind this to make sure I'm not missing something.

For context, using the MarginConfidenceReason with the AG News training set yields almost the entire dataset (117788 examples for the default threshold of 0.2, and 112995 for threshold=0.5). I guess this could become useful when there's overlap with other reasons, but I want to make sure about the reasoning :-).

vincent d warmerdam · Answer 1 · Tue Dec 07 2021 19:20:43 GMT+0800 (China Standard Time)

I think you may have spotted a bug. Lemme fix that real quick and make a new release.

vincent d warmerdam · Answer 2 · Tue Dec 07 2021 19:47:17 GMT+0800 (China Standard Time)

Just released version 0.1.3 with a fix. Thanks for reporting!