logpai / Drain3

A robust streaming log template miner based on the Drain algorithm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`get_parameters_list` can return incorrect parameters

Impelon opened this issue · comments

I'm back again with another inconsistency.
Observe the following example:

>>> from drain3 import *
>>> parser = TemplateMiner()
>>> template = "<hdfs_uri>:<number>+<number>"
>>> content = "hdfs://msra-sa-41:9000/pageinput2.txt:671088640+134217728"
>>> parser.get_parameter_list(template, content)
['hdfs', '//msra-sa-41:9000/pageinput2.txt:671088640', '134217728']

Now of course this arises in a context where I use some custom masking patterns.
The expected parameter-list according to those masking patterns would be:
['hdfs//msra-sa-41:9000/pageinput2.txt', '671088640', '134217728']
but get_parameter_list does not take that into account.

I'll give another more concise example, to demonstrate why this fails:

>>> parser.config.masking_instructions = [masking.MaskingInstruction(r"\d+\.\d+", "float")]
>>> parser.get_parameter_list("<float>.<*>", "0.15.Test")
['0', '15.Test']

Therefore the problem is that the delimiter between these two parameters . is also part of the desired first parameter 0.15.
I gave it a thought and I think that implies this problem can only occur with custom masking patterns:
Under normal circumstances Drain would not produce a template where two parameters are separated by a delimiter other than a space. And since a parameter can only be a single token, they do not contain spaces and therefore the problem above does not occur.
(This might be a different story for extra_delimiters, but for the simple examples I can think of there shouldn't be any problems with that either.)


One solution would be to use the masking patterns to extract any parameters first and then apply the regular parameter extraction.
I'm working on a solution using this idea, but it's not ready yet, as it is a bit challenging to preserve the correct order.

Alternatively one could include the masking-pattern in the mask, e.g. <float|\d+\.\d+>.
Then one could use these patterns instead of (.*?):
https://github.com/IBM/Drain3/blob/6fd6117859f45560f0e576ffcbcc63863d65bdde/drain3/template_miner.py#L181
But this would mean that regexes need to be present in log-templates which is obviously less readable.
If the mask_with-attributes were unique across all MaskingInstruction-objects, one could simply use the mask to determine the required pattern, but at the moment users are free to assign multiple MaskingInstructions with the same mask_with-value.

Now since the masking patterns would need to be evaluated twice if you'd want to get the template of a log message and also the corresponding parameters, one could think about (optionally) including the parameters in the return-values of add_log_message(...) and match(...) directly. But that would also require changing multiple methods in drain.py so that would be more cumbersome.

@Impelon Many thank you for reporting this with a detailed report and for your pull request.
I am not able to accept your PR #50 yet, as it seems to fail on the examples you provided yourself.
Instead, as a temporary solution, I attempted to improve the current code and now I believe it is able to handle the cases you provided. Please see PR #51 and the test cases I added and tell me what you think.

Basically, I changed the mask matching from non-greedy to greedy. I am aware that it is not a full solution, as some counterexamples in which non-greedy matching can be presented. However, I believe that it's better suited for the common case.

BTW I tried to use the code from your PR and the first 2 test cases from test_get_param_list_direct() failed.

I think that it is not possible to provide a full solution for this without having Drain3 use the actual masking instructions when matching.

For example:

template = "<float>.<*>.<float>"
content = "0.15.Test.0.2"
params = template_miner.get_parameter_list(template, content)
expected_params = ["0.15", "Test", "0.2"]
self.assertListEqual(params, expected_params)

Unless Drain3 knows what is float, it cannot extract parameters correctly.

I agree with the two long-term solution approaches you mentioned yourself.
I think that the second one (extracting parameters while mining) is better as it should be more efficient not having to match regexes twice. If you can contribute either of those it would be extremely welcome.
For the first approach, the issue of non-unique mapping from mask name e.g. <NUM> to a regex can be resolved using an or (') operator in the mask matching regex, and the user may use unique mask names if he/she wants to avoid that.

@davidohana Thanks as always for the quick response!

Basically, I changed the mask matching from non-greedy to greedy. I am aware that it is not a full solution, as some counterexamples in which non-greedy matching can be presented. However, I believe that it's better suited for the common case.

I believe you are right in the sense that the solution in #51 may be better suited in the most common cases.
Unfortunately for my application I need to introduce quite a few complex masking-patterns, some of which do contain spaces.
I believe that will not work with the change from #51.

I think that it is not possible to provide a full solution for this without having Drain3 use the actual masking instructions when matching.
[...]
Unless Drain3 knows what is float, it cannot extract parameters correctly.

I agree, indeed this is what my proposal in #50 tries to do.
I've included your tests from #51 and also added the MaskingInstruction-objects required for the new method to work.

BTW I tried to use the code from your PR and the first 2 test cases from test_get_param_list_direct() failed.

You are right; even with the correct MaskingInstruction-objects added, the method from 879593d failed to extract the correct parameters, because the temporary masks added did interfere with other masking-patterns.

Edit: I've run into multiple problems using the approach from #50 and decided to scrap it after all.
With your tests and feedback I've now been able to improve upon #50.
The method in 26399e4 is able to pass all new (and old) testcases with valid templates.
Please see my comments in #50 for more information on what changed.

For the first approach, the issue of non-unique mapping from mask name e.g. <NUM> to a regex can be resolved using an or (|) operator in the mask matching regex, and the user may use unique mask names if he/she wants to avoid that.

I think this is also a good idea and I believe it to be a good alternative to the proposed changes in #50.
#50 or This would be partial solutions for the problem that will work in most cases.
I agree with you that the best long-term solution is to extract parameters while mining.

@davidohana I appologize for the many revisions of #50.
The idea behind #50 was too fragile and led to confusing code.

I actually found another situation in which get_parameters_list performs poorly:

>>> parser.get_parameter_list("<memory:8>", "<memory:<number>>")
[]

I came to the conclusion that your proposed solution to use | to join multiple patterns with the same mask will be an easier and more elegant solution.
I've implemented this solution in #52, added the tests from #51 and a few new tests.
The great news is that this solution handles all test-cases and problems I found so far!

#52 merged, will be included in 0.9.9 release of Drain3.