mozilla-services / hindsight

Hindsight - light weight data processing skeleton

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Investigate ways to reduce the analysis message matcher overhead

trink opened this issue · comments

Some Approaches to Test

  • In-line all the matchers before performing any analysis
    • result: added complexity and required lua_sandbox API changes without demonstrating general benefit to most of the current use cases.
  • Hash router, Analyze the matchers and create a hash table lookup for all matchers keying off a particular header/field e..g Logger ==
  • Tree router, Analyze the matchers and create a hierarchy of matchers so entire groups of matchers can be eliminated by a single match

This is a work in progress, experimentation will continue as the schedule allows.

+1 :)
I have very long message matchers which are bottlenecks.

There may be some things you can do to optimize specific matchers (order of expressions and types of comparisons). If you can share some problem matchers I will take a look.

The goal or the remaining items above is to handle many matchers faster (by clustering) so the large matchers would have to share some conditional expressions that could fail the entire set fast (i.e. if they are relatively unique this experimentation will not help).

I understand this optimization only apply to analysis plugin ? (with all thread sharing the same message_matcher).
In my case I have an output plugin with a long message matcher string:
message_matcher = "(Type =~ '/AAAAA$' || Type =~ '/BBBBB$' || [ ... ] )"
For now I solved the bottleneck by splitting it into multiple instance plugins.

Yeah mozilla-services/lua_sandbox#213 is about all I can squeeze out of a single matcher.

mozilla-services/lua_sandbox#208 may be relevant if the string at the end is unique so you don't need to actually anchor it. Type =~ '/unique' is multiple times faster than Type =~ '/unique$'

Yeah I tested and it's faster without the trailing "$" .
That's amazing, usually we could think the "$" is faster cause not all string must be parsed, just the end !