Use threshold/limit for incoming data

Question

Use threshold/limit for incoming data

totifra opened this issue a year ago · comments

Hey there,

I am wondering how I could use the different threshold methods for new incoming data after fitting to a training dataset. In the current implementation each call of eval() re-computes the limit and re-sets the threshold (self.thresh_ = limit) hence I cannot use eval() for new incoming data. I could access self.thresh_ directly, but then I need to know how the normalization was done for the training data, but this information is not stored in the BaseThresholder.
Am I missing something? Or is my use case not covered by this package?

(Otherwise I just normalize the training data beforehand und reuse that normalizer for new incoming data. That might work)

Thanks in advance!
totifra

Daniel Kulik · Answer 1 · Tue May 09 2023 01:40:28 GMT+0800 (China Standard Time)

Hey @totifra thanks for the great question! I'll also update this answer to the FAQ in the docs.

So there are a few ways to threshold new incoming data with respect to the training dataset. My suggested method is that the outlier likelihood scores of the incoming data are computed with regards to the training data. This can be done with many of the outlier detection methods (e.g. using the decision_function function of a fitted PyOD model). It is important to note that not all outlier detection methods genuinely implement this functionality correctly so best to check. The threshold method can be independently called for both datasets with reasonable confidence that the new data is getting thresholded with respected to the training dataset simply based on the likelihood scores.

However, if this is not sufficient and you would like more control over the thresholding you can also try the above mentioned method with a few extra steps.

Fit an outlier detection model to a training dataset
MinMax normalize the likelihood scores (you are correct about this)
Evaluate the normalized likelihood scores with a thresholding method
Get the threshold point from the normalized scores using the fitted thresholder from the .thresh_ attribute as done in https://pythresh.readthedocs.io/en/latest/example.html
Apply the decision function of the fitted outlier detection method to the new incoming data and get the likelihood scores
Normalize the new likelihood scores with the fitted MinMax from the training dataset
Threshold these new scores using the thresh_ value that you obtained earlier like this: new_labels = cut(normalized_new_scores, thresh_value) where the function cut can be imported from pythresh.thresholds.thresh_utility

Note that if the training dataset was not meant to have outliers but rather serve as a reference or baseline for new incoming data the first mentioned method is probably the better option. If the datasets, training and new, both are suspected of having outliers and the data drift between the two datasets it small, the second option should work well.

Hope this helps and works for you ;)

Thomas · Answer 2 · Wed May 10 2023 14:02:22 GMT+0800 (China Standard Time)

Hey @KulikDM,

thanks for your quick and detailed response.

The threshold method can be independently called for both datasets with reasonable confidence that the new data is getting thresholded with respected to the training dataset simply based on the likelihood scores.

I haven't thought about it in that way. Makes totally sense! Thanks for enlighten me 😄! However, I think I have to stick to the second approach since usually I will receive only one new sample after the other instead of obtaining one dataset of several samples.

Thanks for your help!