Update contamination without retraining
jesuinovieira opened this issue · comments
Given that "contamination is just a universal parameter across all detection algorithms. what it is doing is to simply rank all points by raw outlier scores and then mark the top %contamination as outliers", can you modify the contamination value after the training phase?
For instance, if you initially trained the classifier with a contamination value of 0.01 but now wish to increase it to 0.1 without retraining the model. Is such an update feasible?
given your model is clf = KNN()
at any moment you can
clf.contamination = 0.01
The predicted output remains unchanged despite these adjustments, as you can see below:
from pyod.models.iforest import IForest
from pyod.utils.data import generate_data
contamination = 0.5
X_train, X_test, _, _ = generate_data(
n_train=1000, n_test=500, n_features=5, contamination=contamination, random_state=42
)
# Train clf with default contamination (0.1)
clf = IForest()
clf.fit(X_train)
y_hat = clf.predict(X_test)
print(f"Contamination: {round(clf.contamination, 5)}")
print(f"Test: {(y_hat == 1).sum()} out of {len(X_test)}\n")
# Update contamination to 0.5 and predict again
clf.contamination = contamination
y_hat = clf.predict(X_test)
print(f"Contamination: {round(clf.contamination, 5)}")
print(f"Test: {(y_hat == 1).sum()} out of {len(X_test)}")
Contamination: 0.1
Test: 65 out of 500
Contamination: 0.5
Test: 65 out of 500
Upon inspecting the codebase, I wonder if I should invoke _process_decision_scores
or a similar method to update clf.threshold_
, which is utilized in predict()
?
The modifications made to the code seem to be producing the desired results. By inserting clf._process_decision_scores()
immediately after updating the contamination parameter, the new output is as follows:
Contamination: 0.1
Test: 61 out of 500
Contamination: 0.5
Test: 244 out of 500
Could you kindly verify if these modifications are safe and the approach is appropriate?
this appears to be good!