shinken-monitoring / mod-livestatus

Shinken module for presenting data with a MK/Livestatus comptabile interface

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Performance Issue

olivierHa opened this issue · comments

A ticket to follow my issue :

As already discussed on IRC, I got "huge" performance issue with newer livestatus and Shinken 2.4.

I've downgraded the livestatus module to use the same commit "21ddfb567f8996a5efd94000104d5b0596c91174" as before (ie with Shinken 2.0.3) and it works well with Shinken 2.4 => 4 pages of commit to review :)

I will try to cherry pick commit by commit and see ...

e9c08c7

This commit seems "safe" to use with Shinken2.4 (to performance issue, and no modification needed to be imported).

Let's continue :)

The last "working" commit is b9f2883

Despites its name, I think d61b659 is a performance regression.

Well, the commit is big :/. Ping @gst, he may be the best one to find what's wrong :)

Anyway, try to fill with more detail so that @gst can digg this up. What is the perf issue?
Do you have any sample of query too long to answer?

Like

echo -e "GET hosts\n\n" | nc localhost 50000

livestatus/thruk are unusable/unresponsive, queries take forever to respond (800hosts/15k services)

Maybe a lock issue ?

https://github.com/shinken-monitoring/mod-livestatus/blob/master/module/livestatus_client_thread.py#L65

maybe this ? Each LS request , there is a copy of the broker db. it could be big and consume time no ?

Hum, then it's not a specific query that make LS slow. I think the brokerdb is just module instance with nothing specific in it.

Did you manage to find a query that make it slow?

IMO you should take the upstream version because I think gst did some fix after this so that we can fix more easily.

I will test 1.4.1, but 1.4 isn't working. I don't see any updates between the 2 which could fix this issue; but why not :)

And , for me every queries are slow, it isn't specific to a query (like a query with history).

Hmm, the most efficient may to add some print when parsing query etc and just send the previous command and see where it is slow. As you can reproduce it for every query it's fine. But you have to be sure that you are the one querying. I mean if the LS is already chocking because of a huge amount of request it's may be not that easy to find the "slow" part.

Upstream version is also very "slow"
To give you a concrete example :
This query :

cat bin/check_services
GET services
Columns: last_state_change host_name description state acknowledged
Filter: state = 2
Filter: state = 3
Or: 2
Filter: acknowledged = 0
Filter: host_state = 0
Filter: in_check_period = 1
Filter: in_notification_period = 1
Filter: scheduled_downtime_depth = 0
Filter: notifications_enabled = 1

Using the upstream version :

time cat bin/check_services | nc 10.42.42.42 50000
[...]
real 0m15.909s
user 0m0.000s
sys 0m0.020s

Using commit "b9f2883e5f834e39634e02dca5d0bee69effc3b9"
time cat bin/check_services | nc 10.42.42.42 50000
[...]
real 0m0.950s
user 0m0.000s
sys 0m0.004s

You see the "performance" impact ? :)

great for the query and the commits/versions&time infos :) that'll already give a good starting point.
I'll first try to reproduce the issue and let know..
what about your config ? I saw somewhere about ~5k hosts and ~20k services .. that's it ? anything particular ?
also what about the system load there had on the hosts/systems running/executing the broker & scheduler when the query takes too long ? always reproducible by the way ?

I'll look into that more later though.

Indeed, 5k hosts/20k services, and yes always reproducible.

2015-04-28 23:23 GMT+02:00 Grégory Starck notifications@github.com:

great for the query and the commits/versions&time infos :) that'll already
give a good starting point.
I'll first try to reproduce the issue and let know..
what about your config ? I saw somewhere about ~5k hosts and ~20k services
.. that's it ? anything particular ?
also what about the system load there had on the hosts/systems
running/executing the broker & scheduler when the query takes too long ?
always reproducible by the way ?

I'll look into that more later though.


Reply to this email directly or view it on GitHub
#58 (comment)
.

Did you try with and without threads on?

On Tue, Apr 28, 2015 at 11:30 PM, Olivier H notifications@github.com
wrote:

Indeed, 5k hosts/20k services, and yes always reproducible.

2015-04-28 23:23 GMT+02:00 Grégory Starck notifications@github.com:

great for the query and the commits/versions&time infos :) that'll
already
give a good starting point.
I'll first try to reproduce the issue and let know..
what about your config ? I saw somewhere about ~5k hosts and ~20k
services
.. that's it ? anything particular ?
also what about the system load there had on the hosts/systems
running/executing the broker & scheduler when the query takes too long ?
always reproducible by the way ?

I'll look into that more later though.


Reply to this email directly or view it on GitHub
<
#58 (comment)

.


Reply to this email directly or view it on GitHub
#58 (comment)
.

my first results show "almost" the same time of processing in both versions, let's say a 10-25% more for @master but absolutely not so much (almost 1500% more) than your result..

but I have everything running locally (and my laptop battery is nearly over, I'm out actually, damn..) with 5k hosts & 20k services pointing to 127.0.0.1, the poller polling.. etc.. my system load was almost > 3.5 continously.. not best condition but that tells me that my config must not match enough yours (i.e: that's not only an effect of a such big config).

what is reassuring is that I got the same results in both versions;
do you also get the same results in both versions ?

anyway: I don't know if it's possible to get your config, more or less ?? that would really ease my investigation as I'll be sure I'm with the same config (or as close as possible) ?

What "tests" are you running on both instances ?
Could you try for example to launch 10 query (different or not) at the
same time ?

2015-04-29 14:41 GMT+02:00 Grégory Starck notifications@github.com:

my first results show "almost" the same time of processing in both
versions, let's say a 10-25% more for @master https://github.com/master
but absolutely not so much (almost 1500% more) than your result..

but I have everything running locally (and my laptop battery is nearly
over, I'm out actually, damn..) with 5k hosts & 20k services pointing to
127.0.0.1, the poller polling.. etc.. my system load was almost > 3.5
continously.. not best condition but that tells me that my config must not
match enough yours (i.e: that's not only an effect of a such big config).

what is reassuring is that I got the same results in both versions;
do you also get the same results in both versions ?

anyway: I don't know if it's possible to get your config, more or less ??
that would really ease my investigation as I'll be sure I'm with the
same config (or as close as possible) ?


Reply to this email directly or view it on GitHub
#58 (comment)
.

no "tests", I just let everything running, my scheduler.. schedules checks, the poller .. executes the checks, my broker receives the broks and process them.. and I then just execute your LS query the exact same way than you..

but I'll give another check in about 2 hour, when I'll have recovered my battery charger ;)

about your config, possible to get it more or less the same ?

@olivierHa

I gave it a second try (executed the query like 50 times in a row and estimated average result, which anyway was about always the same for each of the execution which is hopefully expected) and same results : master is "only" 15-25% above @b9f288 .. on my particular config (configured some services to succeed and others to fail). For info: the result of the query produced up to 8k rows (in both case).
The processing time was in average 2 or sometimes ~3 seconds, which given the load (> 1-2, sometimes 3) generated by all the daemons running locally isn't that bad..

I really need to know more about your config - at best to get a copy of it (sanitized if you wish ofcourse)..

A comment about the 15-25% difference I conclude at the expense of master version vs b9f2883 :

it can be relatively easily explained : the commit coming after b9f2883.. (d61b659) has effectively changed a lot of things.. (now I regret of not having split it in at least 4-5 commits but it won't be impossible to improve things).
One of which is how the queries are actually executed : basically there was this kind of things happening before :

function which parse the query ->(call) function which execute the query from a high view point ->(call) function which sub-executes the query ->(call) others sub-sub-functions which ..

up to a level of 3-4 or even 5(or more) levels (depending on the query itself).

each of the functions basically returned a list of the rows of the result. the caller doing eventually some extra work/filtering on the returned list and then also returns its list result to its own caller, etc .. up to the primary one which finally will emit the resulting rows to the client.

What was not good before is that for queries which produced very very numerous rows then the result was first entirely built in the livestatus process (blocking any other client from executing queries) which could take lot lot of time for very numerous rows of result so.. that was a major problem.

Now what has already changed within commit d61b659 is that each(or most) of the functions in the process of actually executing the queries, now no longer returns list of the resulting rows, but they have been transformed in generators functions ..
So that the response can in fact be like "streamed".
But a generator comes at a small cost. More generators in the pipe (which is the case) come at an higher cost.

I may have changed too much of the functions which returned list to generators, or else there could have some simplification also.

That doesn't explain the enormous difference you actually have though.

Thank for the investigation.

Getting historical logs shouldn't be our main focus. If people want 3
months old logs they can set up a specific broker for that type of queries.
From my point of view, we should focus on being fast on "popular" queries :
hosts / services status
Livestatus must be fast on such queries.

Is it possible to revert only the generator part to see if it comes from
that ?

2015-04-30 18:19 GMT+02:00 Grégory Starck notifications@github.com:

A comment about the 15-25% difference I conclude at the expense of master
version vs b9f2883
b9f2883
:

it can be relatively easily explained : the commit coming after b9f2883
b9f2883..
(d61b659
d61b659)
has effectively changed a lot of things.. (now I regret of not having split
it in at least 4-5 commits but it won't be impossible to improve things).
One of which is how the queries are actually executed : basically there
was this kind of things happening before :

function which parse the query ->(call) function which execute the query
from a high view point ->(call) function which sub-executes the query
->(call) others sub-sub-functions which ..

up to a level of 3-4 or even 5(or more) levels (depending on the query
itself).

each of the functions basically returned a list of the rows of the result.
the caller doing eventually some extra work/filtering on the returned list
and then also returns its list result to its own caller, etc .. up to the
primary one which finally will emit the resulting rows to the client.

What was not good before is that for queries which produced very very
numerous rows then the result was first entirely built in the livestatus
process (blocking any other client from executing queries) which could take
lot lot of time for very numerous rows of result so.. that was a major
problem.

Now what has already changed within commit d61b659
d61b659
is that each(or most) of the functions in the process of actually executing
the queries, now no longer returns list of the resulting rows, but they
have been transformed in generators functions ..

So that the response can in fact be like "streamed".
But a generator comes at a small cost. More generators in the pipe (which
is the case) come at an higher cost.

I may have changed too much of the functions which returned list to
generators, or else there could have some simplification also.

That doesn't explain the enormous difference you actually have though.


Reply to this email directly or view it on GitHub
#58 (comment)
.

Is it possible to revert only the generator part to see if it comes from
that ?

it comes from that ;) well at least the difference I see on my side, not the huge one you get, which should be something else.

I will give it a check.. but I have also began working on a branch to improve that part, not by undoing things but the goal is to improve the global or average case performance.

seems I've got something ok.. gonna be able to retry the "heavy" queries to benchmark the result..

@olivierHa : hold on ; I've something which seems very good now (well quite better ;) )

if you want to test if fast it's here : https://github.com/shinken-monitoring/mod-livestatus/tree/enh_process_response_in_batch

I will test that tomorrow. Keep you inform
Le 5 mai 2015 12:16 AM, "Grégory Starck" notifications@github.com a
écrit :

@olivierHa https://github.com/olivierHa : hold on ; I've something
which seems very good now (well quite better ;) )

if you want to test if fast it's here :
https://github.com/shinken-monitoring/mod-livestatus/tree/enh_process_response_in_batch


Reply to this email directly or view it on GitHub
#58 (comment)
.

I'll probably make a second (mutch cleaner) branch though..

Do you want me to try on this branch ? Or i wait the new one ?

2015-05-05 1:22 GMT+02:00 Grégory Starck notifications@github.com:

I'll probably make a second (mutch cleaner) branch though..


Reply to this email directly or view it on GitHub
#58 (comment)
.

you can already give it a test .. even if the travis build failed on it.. I'll have to check.

for info: I just pushed 2 more (fix-)commits on the branch ;; which now pass completly the travis build..

hourrah !!! ;) but damn, I'll have to double or triple check it..

@olivierHa
no test yet ?
if/when you have time for that, could you also try this branch : https://github.com/shinken-monitoring/mod-livestatus/tree/enh_perf ?
thx.

You didn't see my irc messages :(

2015-05-06 17:10 GMT+02:00 Grégory Starck notifications@github.com:

@olivierHa https://github.com/olivierHa
no test yet ?
if/when you have time for that, could you also try this branch :
https://github.com/shinken-monitoring/mod-livestatus/tree/enh_perf ?
thx.


Reply to this email directly or view it on GitHub
#58 (comment)
.

Indeed, it is a lot better !