yzhao062 / anomaly-detection-resources

Anomaly detection related books, papers, videos, and toolboxes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is there any possbility that a ordinary supervised model performs better than a outlier algorithm in this task?

Minqi824 opened this issue · comments

I have tried some outlier detection datasets (ODDs) in this website like Annthyroid dataset (http://odds.cs.stonybrook.edu/annthyroid-dataset/).

However, when I compare some ordinary supervised models (e.g., SVM and Random Forest), the results indicate that SVM and RF are much better than the anomaly detection algorithms like OC-SVM and Isolation Forest.

I was wonder the reason for this weird results, because threoratically the outlier detection algorithms should perform better in the outlier detection task. Could anyone help me figure this problem? Thanks!

[Sorry for my bad English skills]
The way I see it, the difference between normal classifications (for biased data) and outlier detections is the unsupervised/supervised.

As far as I know, OC-SVM performs outlier detection without the known anomalies. Even if the data from ODD gives information about which data is abnormal, In the real problem, we usually do not know what is the abnormal samples. If we do not know what is abnormal data, the SVM and RF cannot even be used.

If the exact information of anomalies is given, the high performance of SVM looks reasonable for me.

[Sorry for my bad English skills]
The way I see it, the difference between normal classifications (for biased data) and outlier detections is the unsupervised/supervised.

As far as I know, OC-SVM performs outlier detection without the known anomalies. Even if the data from ODD gives information about which data is abnormal, In the real problem, we usually do not know what is the abnormal samples. If we do not know what is abnormal data, the SVM and RF cannot even be used.

If the exact information of anomalies is given, the high performance of SVM looks reasonable for me.

Thanks for your great answer!
I agree with your opinion since many anomaly detection task may not have labels at all (or cost high when labeling). And this may be one of the reasons why we always compare supervised learning models vs supervised learning models, and anomaly detection algorithms vs other detection algorithms.

Another confusion is that why these supervised algorithms (like SVM and RF) performs well even in the highly umbalanced dataset? (e.g., Annthyroid dataset in the ODDs, contaiining 7.42% positive samples). Intuitively spearking, the ordinary classification model may classify all the samples to the majority (negative samples) and cannot detect the anomaly samples, but again the empirical results indicate that my opinion may be wrong.
Could you please explain the above problems or even try some models on the Annthyroid dataset? Thanks a lot!

Actually I tried most of the dataset in ODDs (http://odds.cs.stonybrook.edu/annthyroid-dataset/) and upload the results in my github website (https://github.com/jmq19950824/Anomaly-Detection/blob/master/ODDs.ipynb).

The results indicate that even using the binary classification algorithm (SVM here) could be good to solve the anomaly detection task. Can anyone explain this result?

A rule of thumb, if you have labels, using supervised models is preferred even for anomaly detection.
Charu Aggarwal-- Outlier Analysis--Second Edition--Page 26
image

@yzhao062 ,great answer, thanks a lot. I notice that there exists a sentance "Supervised outlier detection is a (difficult) special case of the classification problem. The main characteristic of this problem is that the labels are extremely unbalanced in terms of relative presense. Since anomalies are far lass common than normal points, it is possible for off-thre-shelf classifiers to predict all test points as normal points and still ahieve excellent accuracy"

I tried some supervised model (like Random Forest) in some extremely unbalanced dataset like Credit Card Fraud Detection (CCFD) dataset in kaggle (https://www.kaggle.com/mlg-ulb/creditcardfraud), where the positive samples only take up 0.172% of the whole dataset (i.e., extremely unbalanced).
However, Random Forest still performs well in this dataset (aboud 0.7~0.8 F1-score), could you please explain this results? Thanks!