Update contamination without retraining

Question

Update contamination without retraining

jesuinovieira opened this issue a year ago · comments

Given that "contamination is just a universal parameter across all detection algorithms. what it is doing is to simply rank all points by raw outlier scores and then mark the top %contamination as outliers", can you modify the contamination value after the training phase?

For instance, if you initially trained the classifier with a contamination value of 0.01 but now wish to increase it to 0.1 without retraining the model. Is such an update feasible?

Yue Zhao · Answer 1 · Tue Jul 18 2023 01:11:44 GMT+0800 (China Standard Time)

given your model is clf = KNN()
at any moment you can

clf.contamination = 0.01

Jesuino Vieira · Answer 2 · Tue Jul 18 2023 02:17:14 GMT+0800 (China Standard Time)

The predicted output remains unchanged despite these adjustments, as you can see below:

from pyod.models.iforest import IForest
from pyod.utils.data import generate_data

contamination = 0.5
X_train, X_test, _, _ = generate_data(
    n_train=1000, n_test=500, n_features=5, contamination=contamination, random_state=42
)

# Train clf with default contamination (0.1)
clf = IForest()
clf.fit(X_train)

y_hat = clf.predict(X_test)

print(f"Contamination: {round(clf.contamination, 5)}")
print(f"Test: {(y_hat == 1).sum()} out of {len(X_test)}\n")

# Update contamination to 0.5 and predict again
clf.contamination = contamination
y_hat = clf.predict(X_test)

print(f"Contamination: {round(clf.contamination, 5)}")
print(f"Test: {(y_hat == 1).sum()} out of {len(X_test)}")

Contamination: 0.1
Test: 65 out of 500

Contamination: 0.5
Test: 65 out of 500

Upon inspecting the codebase, I wonder if I should invoke _process_decision_scores or a similar method to update clf.threshold_, which is utilized in predict()?

Jesuino Vieira · Answer 3 · Wed Jul 19 2023 01:35:42 GMT+0800 (China Standard Time)

The modifications made to the code seem to be producing the desired results. By inserting clf._process_decision_scores() immediately after updating the contamination parameter, the new output is as follows:

Contamination: 0.1
Test: 61 out of 500

Contamination: 0.5
Test: 244 out of 500

Could you kindly verify if these modifications are safe and the approach is appropriate?

Yue Zhao · Answer 4 · Wed Jul 19 2023 17:52:08 GMT+0800 (China Standard Time)

this appears to be good!