A method for abnormal subsequence detection/discord in data stream.
This work:
✅ Proposition of an abnormal subsequence detection method
✅ Comparison to SOTA abnormal subsequence detection methods on their latences and performances
Feel free to contact me at: xxxx@xx (Going to be changed after the review, if there is any problem, initiate an issue and i will reply)
*The final version of our paper Drag-stream is being submitted to 2022 ICDM conference.
- Methods compared: Presentation of methods we compared
- Datasets and their characteristics: Brief Description of datasets and characteristics identified
- Description of the experimental protocol: Description of the experimental protocol
- Results: Presentation of results obtained
- Reproducibility: Details on how to reproduce our tests
- Referencies
As it's the case for most of the anomaly detection methods, the following methods produce an anomaly score for each incoming instance showing how well the instance could be an anomaly, finally a threshold fixed by the user permits to say that instances with anomaly scores higher than the threshold are anomalies. In the literature, data stream anomaly detection methods are mostly separated into statistical based, tree based, proximity based and deep learning based approaches. We have chosen highly used and recommended approaches in each of those categories.
Methods:
- LAMP : A method for abnormal subsequence detection in data stream inspired from Matrix profile
- [Drag-stream] : Our proposition for discord detection
- Matrix Profile : Time sries abnormal subsequence detection .
We selected datasets mostly from diverse domains real life problems.
Dataset | Domain | Size | Number Of Known Anomalies | Has Concept Drifts |
---|---|---|---|---|
stdb_308_1 | ECG | 5400 | 1 | no |
xmitdb_x108_1 | ECG | 5400 | 1 | yes |
mitdb_100_180_1 | ECG | 5400 | 1 | no |
chfdb_chf01_275_1 | ECG | 3751 | 1 | no |
ltstdb_20221_43_1 | ECG | 5400 | 1 | no |
mitdbx_108 | ECG | 16000 | 3 | yes |
qtdbsele0606 | ECG | 15000 | 1 | no |
chfdbchf15 | ECG | 15000 | 1 | no |
ann-gun | video recording | 11248 | 1 | no |
patient respiration | pneumology | 6500 | 1 | yes |
dutch power demand | power demand | 35040 | 4 | no |
gps trajectory | GPS | 17175 | 1 | no |
For each dataset, a bayesian optimization is performed to find best hyperparameters (details of the hyperparameter search space of each method could be found in the implementation details (page 8) section of the summary_of_the_experiment file), then we test the method with the best hyperparameters and record the execution time and the f1-score. Finally we process the latence or response time (average time to treat an instance) (latence =the execution time on the dataset). The f1-score,is processed in order to take into account the accuracy and the recall of each method.
🔗 Anchor Links:
Dataset | DragStream | LAMP | Matrix Profile | |||
---|---|---|---|---|---|---|
Score | Params | Score | Params | Score | Params | |
stdb_308_1 | 0.19 | C=15, W=1330, r=8 | 0.22 | W=1350 | 0.069 | p=2 |
xmitdb_x108_1 | 0.24 | C=14, W=1256, r=2.5 | 0 | W=1350 | 0.554 | p=3 |
mitdb__100_180_1 | 0.5 | C=16, W=1236, r=4.5 | 0 | W=1350 | 0.5468 | p=3 |
chfdb_chf01_275_1 | 0.5 | C=17, W=751, r=2.5 | 0.09 | W=937 | 0.63 | p=3 |
ltstdb_20221_43_1 | 0.4 | C=19, W=440, r=3.0 | 0.1 | W=937 | 0.415 | p=1 |
mitdbx_108 | 0.48 | C=10, W=4479, r=3.5 | 0.285 | W=5400 | 0.821 | p=3 |
qtdbsele0606 | 0.01 | C=10, W=222, r=4.5 | 0.55 | W=3750 | 0.005 | p=1 |
chfdbchf15 | 0.5 | C=15, W=2915, r=1.5 | 0.067 | W=3750 | 0.81 | p=1 |
ann-gun | 0.36 | C=13, W=178, r=1.0 | 0.26 | W=2812 | 0.026 | p=3 |
patient respiration | 0.67 | C=14, W=1011, r=4.5 | 0.24 | W=1627 | 0.46 | p=3 |
dutch power demand | 0.56 | C=29, W=4433, r=2.0 | 0.1639 | W=8760 | 0.75 | p=5 |
gps trajectory | 0.286 | C=18, W=4210, r=8.5 | 0 | W=4293 | 0.08 | p=2 |
Mean |
Dataset | DragStream | LAMP | Matrix Profile |
---|---|---|---|
stdb_308_1 | 7.1 | 554 | 9.54 |
xmitdb_x108_1 | 7.1 | 442 | 6.33 |
mitdb__100_180_1 | 7.29 | 554 | 6.34 |
chfdb_chf01_275_1 | 1.75 | 443 | 2.91 |
ltstdb_20221_43_1 | 1.65 | 3.61 | 1.57 |
mitdbx_108 | 324 | 7162 | 322 |
qtdbsele0606 | 13.43 | 851 | 7.89 |
chfdbchf15 | 49.42 | 2535 | 47.41 |
ann-gun | 13.29 | 1364 | 25.86 |
patient respiration | 14.29 | 531 | 3.63 |
dutch power demand | 15.3 | 9981 | 1042 |
gps trajectory | 115 | 9302 | 206.27 |
Mean |
47.47 |
2810.31 |
140.15 |
🔗 Anchor Links:
Make sure you have at least python 3.6
to install requirement type: pip install -r requirements.txt
On univariate dataset: **python test_discord.py **
The results of the test will be in the folder result. The result file contains (In the result folder):
- The execution time on the dataset
- The F1-score of each method
- The best hyperparameters of each method For each dataset and each method.
Notices: Details on characteristics of the datasets and hyperparameters we found are summarized in the file: [summary_of_the_experiment.pdf]
C.-C. M. Yeh, Y. Zhu, L. Ulanova, N. Begum, Y. Ding, H. A. Dau, D. F. Silva, A. Mueen, and E. Keogh, “Matrix profile i: All pairs similarity joins for time series: A unifying view that includes motifs, discords and shapelets,” in 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 1317–1322, 2016.
Z. Zimmerman et al., "Matrix Profile XVIII: Time Series Mining in the Face of Fast Moving Streams using a Learned Approximate Matrix Profile," 2019 IEEE International Conference on Data Mining (ICDM), 2019, pp. 936-945, doi: 10.1109/ICDM.2019.00104.
T. Nakamura, M. Imamura, R. Mercer, and E. Keogh, “Merlin:Parameter-free discovery of arbitrary length anomalies in massive time series archives,” in IEEE International Conference on Data Mining (ICDM), pp. 1190–1195, 2020.
P. M. Chau, B. M. Duc, and D. T. Anh, “Discord discovery in streaming time series based on an improved hot sax algorithm,” in Proceedings of the Ninth International Symposium on Information and Communication Technology, SoICT 2018, (New York, NY, USA), p. 24–30, Association for Computing Machinery, 2018.
SalehiMahsa et RashidiLida (2018). A Survey on Anomaly detection in Evolving Data. ACM SIGKDD Explorations Newsletter 20(1), 13–23.
Chandola, V., A. Banerjee, et V. Kumar (2009). Anomaly detection : A survey. ACM Comput. Surv. 41(3).