How to prevent data leakage in your model?

Question

How to prevent data leakage in your model?

wentaoxie7 opened this issue 9 months ago · comments

Hi authors,

I am curious about your data preprocessing in your paper, you mentioned that you added some image in the MVTec test images into each class's training dataset, but you still test on the entire test set. These images in MVTec test are used by training and test in your model, is there a possible data leakage? If not, how do you prevent it?

Declan McIntosh · Answer 1 · Fri Oct 27 2023 07:07:11 GMT+0800 (China Standard Time)

We present results both with overlapping training and test data and without. Because the test data added into the training data for other methods is effectively label noise due to the one-class classification assumptions made in previous methods, they will memorize those anomalies as nominal features and incorrectly classify them at test time because they saw them during train time. For our method, we do not assume the class of training data. We are fully unsupervised, so by training on some of the test data, we are really testing if our model can reject anomalies during training effectively. This is necessary to show our method is unsupervised. So, the data leakage is intentional because it is that our filter in the construction of our nominal model is working effectively. Notably, in anomaly detection, the hardest unsupervised split would be to train on ALL the data and test on ALL the data. The splits we chose were a compromise with how the dataset was set up for supervised training.