koaning / doubtlab

Doubt your data, find bad labels.

Home Page:https://koaning.github.io/doubtlab/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Doubt about MarginConfidenceReason :-)

dvsrepo opened this issue · comments

Hi Vincent,

Nice library! As mentioned a while ago on Twitter I'm doing a review to understand and compare different approaches to find label errors.

I'm playing with the AG News dataset, which we know it contains a lot of errors from our own previous experiments with Rubrix (using the training loss and using cleanlab).

While playing with the different reasons, I'm having difficulties to understand the reasoning behind the MarginConfidenceReason. As far as I can tell, if the model is doubting the margin between the top two predicted labels should be small, and that could point to an ambiguous example and/or a label error. If I read the code and description correctly, MarginConfidenceReason is doing the opposite, so I'd love to know the reasoning behind this to make sure I'm not missing something.

For context, using the MarginConfidenceReason with the AG News training set yields almost the entire dataset (117788 examples for the default threshold of 0.2, and 112995 for threshold=0.5). I guess this could become useful when there's overlap with other reasons, but I want to make sure about the reasoning :-).

I think you may have spotted a bug. Lemme fix that real quick and make a new release.

Just released version 0.1.3 with a fix. Thanks for reporting!