loglizer

Loglizer is a machine learning-based log analysis toolkit for system anomaly detection. Logs are imperative in the development and maintenance process of many software systems. They record detailed runtime information during system operation that allows developers and support engineers to monitor their systems and dissect anomalous behaviors and errors. Loglizer provides such a tool that implements a set of automated log analysis techniques for anomaly detection.

🔭 If you use loglizer in your research for publication, please kindly cite the following paper.

Shilin He, Jieming Zhu, Pinjia He, Michael R. Lyu. Experience Report: System Log Analysis for Anomaly Detection, IEEE International Symposium on Software Reliability Engineering (ISSRE), 2016. [Bibtex]

FAQ

We are refactoring our project to make the related pieces of loghub, logparser, loglizer working together. But this may need some time. We receive many enquires about the demo of loglizer, especially about the data. The following the quick reference to obtain the input structured data before we release the next version.

Where is the data?

Actually, we have uploaded all the available log data onto loghub. The raw logs and label info can be downloaded from the Zenodo link there. Note that the raw logs need to be parsed to generate the structured data for loglizer.

I cannot find 'rm_repeat_rawTFVector.txt and 'rm_repeat_mlabel.txt'?

'rm_repeat_mlabel.txt' is renamed to "anomaly_labels.csv" in loghub dataset. 'rm_repeat_rawTFVector.txt' represents the feature vectors by 1) parse the HDFS log 2) generating log sequence by session windows using blk_id 3) count event frequency within each session window and get a feature vector, which is a row of 'rm_repeat_rawTFVector.txt'.

Where to find BGL_MERGED.log and logTemplateMap.csv?

BGL log can be downloaded from loghub. logTemplateMap.csv is available at https://github.com/logpai/logparser/tree/master/logs/BGL as "BGL_templates.csv"

Framework

The log analysis framework for anomaly detection usually comprises the following components:

Log collection: Logs are generated at runtime and aggregated into a centralized place with a data streaming pipeline, such as Flume and Kafka.
Log parsing: Logs are naturally unstructured. The goal of log parsing is to convert unstructured log messages into a sequence of structured events, based on which sophisticated machine learning models can be applied. The details of log parsing can be found at our logparser project.
Feature extraction: Structured logs can be sliced into separate log sequences through interval window, sliding window, or session window. Then, each log sequence is vectorized into feature representation, for example, using an event counting vector.
Anomaly detection: Anomaly detection models are trained to check whether a given feature vector is an anomaly or not.

Models

Anomaly detection models currently available:

Model	Paper reference
Supervised models
LR	[EuroSys'10] Peter Bodík, Moises Goldszmidt, Armando Fox, Hans Andersen. Fingerprinting the Datacenter: Automated Classification of Performance Crises. [Berkeley, Microsoft, Cornell]
Decision Tree	[ICAC'04] Mike Chen, Alice X. Zheng, Jim Lloyd, Michael I. Jordan, Eric Brewer. Failure Diagnosis Using Decision Trees. [Berkeley, eBay]
SVM	[ICDM'07] Yinglung Liang, Yanyong Zhang, Hui Xiong, Ramendra Sahoo. Failure Prediction in IBM BlueGene/L Event Logs. [Rutgers University, IBM]
Unsupervised models
Clustering	[ICSE'16] Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, Xuewei Chen. Log Clustering based Problem Identification for Online Service Systems. [Microsoft]
PCA	[SOSP'09] Wei Xu, Ling Huang, Armando Fox, David Patterson, Michael I. Jordan. Large-Scale System Problems Detection by Mining Console Logs [Berkeley, Intel]
Invariants Mining	[ATC'10] Jian-Guang Lou, Qiang Fu, Shengqi Yang, Ye Xu, Jiang Li. Mining Invariants from Console Logs for System Problem Detection [Microsoft, BUPT, NJU]

Log data

We have released a variety of log datasets in loghub for research purposes. If you are interested in these datasets, please request the logs through the link.

Usage

(Under construction) Please follow the demo in the docs to get started.

Contributors

Shilin He, The Chinese University of Hong Kong
Jieming Zhu, The Chinese University of Hong Kong, currently at Huawei Noah's Ark Lab
Pinjia He, The Chinese University of Hong Kong, currently at ETH Zurich

Feedback

For any questions or feedback, please post to the issue page.

History

May 14, 2016: initial commit
Sep 21, 2017: update code and readme
March 21, 2018: rewrite most of the code and add detailed comments
Dec 15, 2018: restructure the repository with hands-on demo

jock312452 / loglizer