Chinese and English hybrid log template mining

Question

Chinese and English hybrid log template mining

0ptimista opened this issue a year ago · comments

有没有办法对中英混合的日志进行日志模板解析？例如日志

[2023-04-21 10:44:52,281][work-request-pool-38][myservice][engine.db.service.OflLTransListService:57][save][INFO ] =>保存:{“Name":"张三","ltermNo":"NH36BILD","lvouchNo":"102755","orderId":"565165056025","oriAmount":50000,"positionInfo":"longitude=101.1.10937&latitude=23426.8491705&address=云南省丽江市&trans_ip=139.14.18.96","printMName":"小型","status":"0","Time":1682045092000},耗时: 300ms

Drain3目前得到的模板可以正确识别 IP地址和大部分变量，但是会将中文识别为模板，例模板中 ”耗时: 300ms“ 是模板的一部分，而不是”耗时:NUMms"

Superskyyy · Answer 1 · Wed Apr 26 2023 16:03:34 GMT+0800 (China Standard Time)

@0ptimista Hello, this is the problem of the masking (preprocessing) phase. You should consider provide a specific or extended regex (ms)? for this case. The default regex will not mask the 300ms part as it is not a standalone NUM but a NUMms.

你好，这个问题出现在掩码（预处理）阶段。你应该考虑为这种情况提供一个特定或扩展的正则表达式 (ms)?。默认提供的正则表达式不会掩盖 300ms 部分，因为它不是一个独立的数字，而是一个数字加上 ms 的形式。

Superskyyy · Answer 2 · Wed Apr 26 2023 16:13:07 GMT+0800 (China Standard Time)

But indeed in the current algorithm implementation it did not consider a token involving digits to be a variable, as a result, it will lower the similarity of two similar logs with differnt NUMms, but in fact they should not contribute to similarity calculation significantly. It is mentioned in the original paper (DAGDrain) though, so this will be implemented in next releases as a part of the dynamic similarity threshold feature (to replace the default 0.4).

0ptimista · Answer 3 · Wed Apr 26 2023 16:59:26 GMT+0800 (China Standard Time)

@0ptimista Hello, this is the problem of the masking (preprocessing) phase. You should consider provide a specific or extended regex (ms)? for this case. The default regex will not mask the 300ms part as it is not a standalone NUM but a NUMms.

你好，这个问题出现在掩码（预处理）阶段。你应该考虑为这种情况提供一个特定或扩展的正则表达式 (ms)?。默认提供的正则表达式不会掩盖 300ms 部分，因为它不是一个独立的数字，而是一个数字加上 ms 的形式。

Thanks for replying!

I think I'm not point out my question properly. The solution for 300ms is very straightforward. What I am trying to say is that is there a way to keep some Chinese phrase like '保存' '耗时' as a part of the log template since it's not likely to change in future and some phrase like '张三' in “Name":"张三" is clearly a variable and should be masked.

Superskyyy · Answer 4 · Thu Apr 27 2023 00:29:44 GMT+0800 (China Standard Time)

@0ptimista If I understand correctly, I beliece it's due to the Chinese word segmentation problem doesn't work like English (which can simply be splited by blank spaces), Chinese characters require an extra layer of processing using things like Jieba https://github.com/fxsjy/jieba to make them into correct sequences of tokens instead of a continous chunk of text.

BUT, there's a but, since using a full segementation solution would potentially decrease performance significantly (This is required if you don't know what would appear in logs), another way (when you do know a dictionary of what words could appear in logs) will be implementing a new set of regex that could replace any known Chinese phrases into masked tokens + surrounded with a blank, then it will be properly clustered.

Would you like to try it and provide an implementation?

0ptimista · Answer 5 · Thu Apr 27 2023 09:55:45 GMT+0800 (China Standard Time)

Sure. I'm using Drain3 to mine my log, if a solution soloved my original propose, I'll put it here.

Thanks for helping!