koaning / doubtlab

Doubt your data, find bad labels.

Home Page:https://koaning.github.io/doubtlab/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add staticmethods to reasons to prevent re-compute.

dvsrepo opened this issue · comments

I really like the current design with reasons just being function calls.

However, when working with large datasets or in use cases where you already have the predictions of a model, I wonder if you have thought about letting users to pass either a sklearn model or the pre-computed probas (for those Reasons where it make sense). For threshold-based reasons and large datasets this could save some time and compute, allow for faster iteration, and it would open up the possibility of using other models beyond sklearn.

I understand that the design wouldn't be as clean as it is right now, might cause miss-alignments if users don't send the correct shapes/positions, but I wonder if you have considered this (or any other way to pass pre-computed predictions).

Just to illustrate what I mean (sorry about the dirty-pseudo code):

class ProbaReason:

    def __init__(self, model=None, probas=None, max_proba=0.55):
        if not model or probas:
             print("You should at least pass a model or probas")
        self.model = model
        self.probas = probas
        self.max_proba = max_proba

    def __call__(self, X, y=None):
        probas = probas if self.probas else self.model.predict_proba(X)
        result = probas.max(axis=1) <= self.max_proba
        return result.astype(np.float16)

By design, the reasons that are currently implemented that require a scikit-learn model to be passed don't retrain the scikit-learn model internally. So if you have a large dataset, you won't need to worry about retraining. If this isn't the case, you may have found a bug.

If you do happen to have a large dataset however it's currently up to you to chunk the dataset into smaller elements. This seems like a good practice in general because you can look for items on a batch-by-batch basis.

Wouldn't the batch-by-batch approach be more pragmatic in your case?

Part of the reasoning here is that it's more effective to have different models that have a different perspective. Maybe one model uses count vectors while another one uses the universal sentence encoder. Both of these models would cause different probas, but where the reasons based on these proba values overlap ... that's where we might allow for priority.

If I change the API to allow for precomputer proba values I worry about the human copy/paste-error. You'd need many proba matrices around that you'd manually need to track. This is nearly the same as writing custom actions.

Then again ... you are on to something that some of these models may be running extra times which could be prohibitive.

Lemme whip up a quick example so that we have something tangible to discuss.

This is the standard way of working now:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

from doubtlab.ensemble import DoubtEnsemble
from doubtlab.reason import ProbaReason, WrongPredictionReason

# Let's say we have some dataset/model already
X, y = load_iris(return_X_y=True)
model = LogisticRegression(max_iter=1_000)
model.fit(X, y)

# Next we can add reasons for doubt. In this case we're saying
# that examples deserve another look if the associated proba values
# are low or if the model output doesn't match the associated label.
reasons = {
    'proba': ProbaReason(model=model),
    'wrong_pred': WrongPredictionReason(model=model)
}

# Pass these reasons to a doubtlab instance.
doubt = DoubtEnsemble(**reasons)

# Get the ordered indices of examples worth checking again
indices = doubt.get_indices(X, y)

This is what you could do with custom functions.

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

from doubtlab.ensemble import DoubtEnsemble
from doubtlab.reason import ProbaReason, WrongPredictionReason

# Let's say we have some dataset/model already
X, y = load_iris(return_X_y=True)
model = LogisticRegression(max_iter=1_000)
model.fit(X, y)

# Let's just pre-calc some stuff
probas = model.predict_proba(X)
preds = model.predict(X)

# Refer to the precalc stuff here. 
reasons = {
    'proba': lambda X, y: probas.max(axis=1) <= 0.4,
    'wrong_pred': lambda X, y: preds != y
}

# Pass these reasons to a doubtlab instance.
doubt = DoubtEnsemble(**reasons)
# Get the ordered indices of examples worth checking again
indices = doubt.get_indices(X, y)

Wouldn't the functional API already support it? What I'm worried about is that if we need to support probas we may also need to support preds and the API could get ugly real fast.

Would it perhaps help if we add classmethods to the reasons? That might suffice.

I'm leaning towards classmethods since these might also make the codebase easier to test and we could support a naming convention like from_probas() or from_preds(). You would need to be careful to ensure that the shapes of X and probas are kept aligned. Might need to check for that.

I'm leaning towards classmethods since these might also make the codebase easier to test and we could support a naming convention like from_probas() or from_preds(). You would need to be careful to ensure that the shapes of X and probas are kept aligned. Might need to check for that.

Yes, I think classmethods would be a good design choice. In any case, I understand why you design it this way and indeed custom functions would be a way to go for reusing preds and probas, with the cost of having to reimplement the reasons, although they are quite compact already.
As for the batch computation I agree that you could separate the work like this and maybe even filter your dataset before but it would add complexity if you want to iterate with the thresholds. Also batch processing would work for local methods, but global methods like cleanlab benefit from the full data available. I guess if you plan to add more global methods, the ability to instantiate reasons (via staticmethods) with precomputed could make sense. Again that's only some thoughts after playing with the library. Finally, as a disclaimer this feature might also be useful for a potential integration of doubtlab with Rubrix :)

I'll certainly explore it for another release then.

Odds are that I'll also start thinking about adding support for spaCy. I jotted some ideas here: #4.

Cool! For NER, we have implemented several metrics or measurements, inspired by Explainaboard, which can be used both for predictions and annotation (labels). We don't yet provide Disagreement measures (e.g., when entity spans disagree between prediction and annotation, labels disagree, etc.), but we'll add that soon too. What we'll likely not cover is disagreement between models (given our current data model design) so that'd be interesting to see.

These metrics are computed at the record level when you log data into Rubrix and can then be aggregated/queried, used with Pandas (via rb.load), etc.

Here's a quick example (showing the entity consistency metric, which is helpful for finding mentions which have different labels across the dataset):
https://vimeo.com/653199621

You can take a look here:
https://rubrix.readthedocs.io/en/stable/guides/metrics.html#1.-Rubrix-Metrics-for-NER-pipelines-predictions

This is now added in v0.1.4. The docs reflect this new change.