Previously trained messages cannot always be matched
Impelon opened this issue · comments
Observe the following MWE (using the current version 0.9.7
):
>>> from drain3 import *
>>> parser = TemplateMiner()
>>> parser.add_log_message("training4Model start")
{'change_type': 'cluster_created', 'cluster_id': 1, 'cluster_size': 1, 'template_mined': 'training4Model start', 'cluster_count': 1}
>>> parser.add_log_message("loadModel start")
{'change_type': 'cluster_template_changed', 'cluster_id': 1, 'cluster_size': 2, 'template_mined': '<*> start', 'cluster_count': 1}
>>> parser.add_log_message("loadModel stop")
{'change_type': 'cluster_created', 'cluster_id': 2, 'cluster_size': 1, 'template_mined': 'loadModel stop', 'cluster_count': 2}
>>> parser.match("loadModel start") is not None # This message was seen during training, so it should match.
False
Since loadModel start
was previously passed in to parser.add_log_message
, it should be possible to match it later in the "interference" mode.
I would expect previously trained messages to always match later on.
As illustrated above, this cannot be guaranteed for arbitrary messages.
I've analyzed what is happening by looking through the source code and came to the following explanation:
- When
parser.match("loadModel start")
is called, there are 2 clusters present with the following templates:<*> start
: This is the cluster that includes the first occurrence ofloadModel start
.loadModel stop
- Thus when
tree_search
is called,loadModel
matches the prefix of 2nd template and 1st template is completely disregarded. (line 131) - Obviously
loadModel start
does NOT exactly matchloadModel stop
and as such no match is found. (line 134)
https://github.com/IBM/Drain3/blob/6b724e7651bca2ac418dcbad01602a523a70b678/drain3/drain.py#L121-L135
The reason the first template was modified to include a wildcard (which later prevents the log message from matching), is that training4Model
includes a number and thus will replaced by a wildcard if no other prefix matches.
See: https://github.com/IBM/Drain3/blob/6b724e7651bca2ac418dcbad01602a523a70b678/drain3/drain.py#L196-L199
The following example works perfectly fine therefore:
>>> from drain3 import *
>>> parser = TemplateMiner()
>>> parser.add_log_message("trainingModel start")
{'change_type': 'cluster_created', 'cluster_id': 1, 'cluster_size': 1, 'template_mined': 'trainingModel start', 'cluster_count': 1}
>>> parser.add_log_message("loadModel start")
{'change_type': 'cluster_created', 'cluster_id': 2, 'cluster_size': 1, 'template_mined': 'loadModel start', 'cluster_count': 2}
>>> parser.add_log_message("loadModel stop")
{'change_type': 'cluster_template_changed', 'cluster_id': 2, 'cluster_size': 2, 'template_mined': 'loadModel <*>', 'cluster_count': 2}
>>> parser.match("loadModel start") is not None # This message was seen during training, so it should match.
True
I'm not sure how this should be optimally handled.
Indeed an interesting edge case. I am not sure how acute is it, since another call to parser.add_log_message("loadModel start")
would fix it too, and mostly you get many logs of the same type so you can say that match()
is eventually consistent. But I guess it depends on the use case.
I committed two possible ways to solve it:
- Option to disable treating numeric tokens as parameters. This will make Drain treat
training4Model
as a non-parametric token liketrainingModel
. But this is only a patch because it might also occur if the masker decides to parameterize some token. Nevertheless, I thinkparametrize_numeric_tokens
should be disabled if you have good masking instruction coverage. - A more complete solution - added strategy to
match()
function, to fallback to full cluster search in case of no match. This may cost in performance of thematch()
but is good for a case in which match accuracy is super important. It's possible to implement a recursive tree_search (first try exact token match and fallback to wildcard match). This will be both accurate and more efficient than the linear full search, but I am unable to implement if now due to lack of time.
WDYT?
Thank you for the quick response and improvements made!
It's great to have more flexibility.
- Option to disable treating numeric tokens as parameters. [...] Nevertheless, I think
parametrize_numeric_tokens
should be disabled if you have good masking instruction coverage.
I believe this would indeed solve the problem in my case.
- A more complete solution - added strategy to
match()
function, to fallback to full cluster search in case of no match. This may cost in performance of thematch()
but is good for a case in which match accuracy is super important.
In my case I it is important that I don't modify existing templates after the training phase. The most I could do in case match()
fails is to add a new template. For me the new fallback
strategy would be a good fit, just as an extra precaution in case setting parametrize_numeric_tokens = False
does not resolve everything.
Just as an extra note, I think the current documentation for the always
strategy is a bit misleading:
https://github.com/IBM/Drain3/blob/8c1e94d74b682b38d18b8091480c58f762acd491/drain3/drain.py#L357-L359
As far as I understand it suffers from the same problem as the fallback
-strategy in the sense that a
non-optimal match with more wildcard parameters than necessary
can be found, and this is even more likely here since a restricted search was not run before.
Another thing to note is that fast_match
does not necessarily require a list of IDs, just a sequence.
So just passing self.id_to_cluster.keys()
will be more memory efficient and slightly faster because no list is created, but I suspect that there are few situations where this will actually makes a difference.
https://github.com/IBM/Drain3/blob/8c1e94d74b682b38d18b8091480c58f762acd491/drain3/drain.py#L373-L376
always
should always return the best match as opposed to fallback
, because fallback
will return the best match only among the clusters in the same leaf node, in case tree search found a match. always
compares all clusters with the same sequence length, starting from the root. Tried to clarify that in the docs.
Thanks for the comment regarding list
cast - removed it.
I see now I have a bug in always
, working on a fix.
Fixed.
always
should always return the best match as opposed tofallback
, becausefallback
will return the best match only among the clusters in the same leaf node, in case tree search found a match.always
compares all clusters with the same sequence length, starting from the root. Tried to clarify that in the docs.
Got it now! Thanks again for the quick fixes.
Changes are available at v0.9.8, released today.