Performance Issue

Question

Performance Issue

olivierHa opened this issue 9 years ago · comments

A ticket to follow my issue :

As already discussed on IRC, I got "huge" performance issue with newer livestatus and Shinken 2.4.

I've downgraded the livestatus module to use the same commit "21ddfb567f8996a5efd94000104d5b0596c91174" as before (ie with Shinken 2.0.3) and it works well with Shinken 2.4 => 4 pages of commit to review :)

I will try to cherry pick commit by commit and see ...

Olivier H · Answer 1 · Wed Apr 29 2015 03:15:48 GMT+0800 (China Standard Time)

e9c08c7

This commit seems "safe" to use with Shinken2.4 (to performance issue, and no modification needed to be imported).

Let's continue :)

Olivier H · Answer 2 · Wed Apr 29 2015 03:49:03 GMT+0800 (China Standard Time)

The last "working" commit is b9f2883

Despites its name, I think d61b659 is a performance regression.

Seb Cvx · Answer 3 · Wed Apr 29 2015 03:51:38 GMT+0800 (China Standard Time)

Well, the commit is big :/. Ping @gst, he may be the best one to find what's wrong :)

Seb Cvx · Answer 4 · Wed Apr 29 2015 04:12:52 GMT+0800 (China Standard Time)

Anyway, try to fill with more detail so that @gst can digg this up. What is the perf issue?
Do you have any sample of query too long to answer?

Like

echo -e "GET hosts\n\n" | nc localhost 50000

Olivier H · Answer 5 · Wed Apr 29 2015 04:16:57 GMT+0800 (China Standard Time)

livestatus/thruk are unusable/unresponsive, queries take forever to respond (800hosts/15k services)

Maybe a lock issue ?

Olivier H · Answer 6 · Wed Apr 29 2015 04:30:47 GMT+0800 (China Standard Time)

https://github.com/shinken-monitoring/mod-livestatus/blob/master/module/livestatus_client_thread.py#L65

maybe this ? Each LS request , there is a copy of the broker db. it could be big and consume time no ?

Seb Cvx · Answer 7 · Wed Apr 29 2015 04:34:58 GMT+0800 (China Standard Time)

Hum, then it's not a specific query that make LS slow. I think the brokerdb is just module instance with nothing specific in it.

Did you manage to find a query that make it slow?

IMO you should take the upstream version because I think gst did some fix after this so that we can fix more easily.

Olivier H · Answer 8 · Wed Apr 29 2015 04:45:12 GMT+0800 (China Standard Time)

I will test 1.4.1, but 1.4 isn't working. I don't see any updates between the 2 which could fix this issue; but why not :)

Olivier H · Answer 9 · Wed Apr 29 2015 04:46:13 GMT+0800 (China Standard Time)

And , for me every queries are slow, it isn't specific to a query (like a query with history).

Seb Cvx · Answer 10 · Wed Apr 29 2015 04:57:21 GMT+0800 (China Standard Time)

Hmm, the most efficient may to add some print when parsing query etc and just send the previous command and see where it is slow. As you can reproduce it for every query it's fine. But you have to be sure that you are the one querying. I mean if the LS is already chocking because of a huge amount of request it's may be not that easy to find the "slow" part.

Olivier H · Answer 11 · Wed Apr 29 2015 04:59:52 GMT+0800 (China Standard Time)

Upstream version is also very "slow"
To give you a concrete example :
This query :

cat bin/check_services
GET services
Columns: last_state_change host_name description state acknowledged
Filter: state = 2
Filter: state = 3
Or: 2
Filter: acknowledged = 0
Filter: host_state = 0
Filter: in_check_period = 1
Filter: in_notification_period = 1
Filter: scheduled_downtime_depth = 0
Filter: notifications_enabled = 1

Using the upstream version :

time cat bin/check_services | nc 10.42.42.42 50000
[...]
real 0m15.909s
user 0m0.000s
sys 0m0.020s

Using commit "b9f2883e5f834e39634e02dca5d0bee69effc3b9"
time cat bin/check_services | nc 10.42.42.42 50000
[...]
real 0m0.950s
user 0m0.000s
sys 0m0.004s

You see the "performance" impact ? :)

Grégory Starck · Answer 12 · Wed Apr 29 2015 05:23:50 GMT+0800 (China Standard Time)

great for the query and the commits/versions&time infos :) that'll already give a good starting point.
I'll first try to reproduce the issue and let know..
what about your config ? I saw somewhere about ~5k hosts and ~20k services .. that's it ? anything particular ?
also what about the system load there had on the hosts/systems running/executing the broker & scheduler when the query takes too long ? always reproducible by the way ?

I'll look into that more later though.

Olivier H · Answer 13 · Wed Apr 29 2015 05:30:47 GMT+0800 (China Standard Time)

Indeed, 5k hosts/20k services, and yes always reproducible.

2015-04-28 23:23 GMT+02:00 Grégory Starck notifications@github.com:

great for the query and the commits/versions&time infos :) that'll already
give a good starting point.
I'll first try to reproduce the issue and let know..
what about your config ? I saw somewhere about ~5k hosts and ~20k services
.. that's it ? anything particular ?
also what about the system load there had on the hosts/systems
running/executing the broker & scheduler when the query takes too long ?
always reproducible by the way ?

I'll look into that more later though.

—
Reply to this email directly or view it on GitHub
#58 (comment)
.

Gabès Jean · Answer 14 · Wed Apr 29 2015 14:50:15 GMT+0800 (China Standard Time)

Did you try with and without threads on?

On Tue, Apr 28, 2015 at 11:30 PM, Olivier H notifications@github.com
wrote:

Indeed, 5k hosts/20k services, and yes always reproducible.

2015-04-28 23:23 GMT+02:00 Grégory Starck notifications@github.com:

great for the query and the commits/versions&time infos :) that'll
already
give a good starting point.
I'll first try to reproduce the issue and let know..
what about your config ? I saw somewhere about ~5k hosts and ~20k
services
.. that's it ? anything particular ?
also what about the system load there had on the hosts/systems
running/executing the broker & scheduler when the query takes too long ?
always reproducible by the way ?

I'll look into that more later though.

—
Reply to this email directly or view it on GitHub
<
#58 (comment)

.

—
Reply to this email directly or view it on GitHub
#58 (comment)
.

Grégory Starck · Answer 15 · Wed Apr 29 2015 20:41:00 GMT+0800 (China Standard Time)

my first results show "almost" the same time of processing in both versions, let's say a 10-25% more for @master but absolutely not so much (almost 1500% more) than your result..

but I have everything running locally (and my laptop battery is nearly over, I'm out actually, damn..) with 5k hosts & 20k services pointing to 127.0.0.1, the poller polling.. etc.. my system load was almost > 3.5 continously.. not best condition but that tells me that my config must not match enough yours (i.e: that's not only an effect of a such big config).

what is reassuring is that I got the same results in both versions;
do you also get the same results in both versions ?

anyway: I don't know if it's possible to get your config, more or less ?? that would really ease my investigation as I'll be sure I'm with the same config (or as close as possible) ?

Olivier H · Answer 16 · Wed Apr 29 2015 20:45:29 GMT+0800 (China Standard Time)

What "tests" are you running on both instances ?
Could you try for example to launch 10 query (different or not) at the
same time ?

2015-04-29 14:41 GMT+02:00 Grégory Starck notifications@github.com:

my first results show "almost" the same time of processing in both
versions, let's say a 10-25% more for @master https://github.com/master
but absolutely not so much (almost 1500% more) than your result..

but I have everything running locally (and my laptop battery is nearly
over, I'm out actually, damn..) with 5k hosts & 20k services pointing to
127.0.0.1, the poller polling.. etc.. my system load was almost > 3.5
continously.. not best condition but that tells me that my config must not
match enough yours (i.e: that's not only an effect of a such big config).

what is reassuring is that I got the same results in both versions;
do you also get the same results in both versions ?

anyway: I don't know if it's possible to get your config, more or less ??
that would really ease my investigation as I'll be sure I'm with the
same config (or as close as possible) ?

—
Reply to this email directly or view it on GitHub
#58 (comment)
.

Grégory Starck · Answer 17 · Wed Apr 29 2015 20:56:18 GMT+0800 (China Standard Time)

no "tests", I just let everything running, my scheduler.. schedules checks, the poller .. executes the checks, my broker receives the broks and process them.. and I then just execute your LS query the exact same way than you..

but I'll give another check in about 2 hour, when I'll have recovered my battery charger ;)

about your config, possible to get it more or less the same ?

Grégory Starck · Answer 18 · Wed Apr 29 2015 23:00:59 GMT+0800 (China Standard Time)

@olivierHa

I gave it a second try (executed the query like 50 times in a row and estimated average result, which anyway was about always the same for each of the execution which is hopefully expected) and same results : master is "only" 15-25% above @b9f288 .. on my particular config (configured some services to succeed and others to fail). For info: the result of the query produced up to 8k rows (in both case).
The processing time was in average 2 or sometimes ~3 seconds, which given the load (> 1-2, sometimes 3) generated by all the daemons running locally isn't that bad..

I really need to know more about your config - at best to get a copy of it (sanitized if you wish ofcourse)..

Grégory Starck · Answer 19 · Fri May 01 2015 00:19:37 GMT+0800 (China Standard Time)

A comment about the 15-25% difference I conclude at the expense of master version vs b9f2883 :

it can be relatively easily explained : the commit coming after b9f2883.. (d61b659) has effectively changed a lot of things.. (now I regret of not having split it in at least 4-5 commits but it won't be impossible to improve things).
One of which is how the queries are actually executed : basically there was this kind of things happening before :

function which parse the query ->(call) function which execute the query from a high view point ->(call) function which sub-executes the query ->(call) others sub-sub-functions which ..

up to a level of 3-4 or even 5(or more) levels (depending on the query itself).

each of the functions basically returned a list of the rows of the result. the caller doing eventually some extra work/filtering on the returned list and then also returns its list result to its own caller, etc .. up to the primary one which finally will emit the resulting rows to the client.

What was not good before is that for queries which produced very very numerous rows then the result was first entirely built in the livestatus process (blocking any other client from executing queries) which could take lot lot of time for very numerous rows of result so.. that was a major problem.

Now what has already changed within commit d61b659 is that each(or most) of the functions in the process of actually executing the queries, now no longer returns list of the resulting rows, but they have been transformed in generators functions ..
So that the response can in fact be like "streamed".
But a generator comes at a small cost. More generators in the pipe (which is the case) come at an higher cost.

I may have changed too much of the functions which returned list to generators, or else there could have some simplification also.

That doesn't explain the enormous difference you actually have though.

Olivier H · Answer 20 · Mon May 04 2015 17:06:56 GMT+0800 (China Standard Time)

Thank for the investigation.

Getting historical logs shouldn't be our main focus. If people want 3
months old logs they can set up a specific broker for that type of queries.
From my point of view, we should focus on being fast on "popular" queries :
hosts / services status
Livestatus must be fast on such queries.

Is it possible to revert only the generator part to see if it comes from
that ?

2015-04-30 18:19 GMT+02:00 Grégory Starck notifications@github.com:

A comment about the 15-25% difference I conclude at the expense of master
version vs b9f2883
b9f2883
:

it can be relatively easily explained : the commit coming after b9f2883
b9f2883..
(d61b659
d61b659)
has effectively changed a lot of things.. (now I regret of not having split
it in at least 4-5 commits but it won't be impossible to improve things).
One of which is how the queries are actually executed : basically there
was this kind of things happening before :

function which parse the query ->(call) function which execute the query
from a high view point ->(call) function which sub-executes the query
->(call) others sub-sub-functions which ..

up to a level of 3-4 or even 5(or more) levels (depending on the query
itself).

each of the functions basically returned a list of the rows of the result.
the caller doing eventually some extra work/filtering on the returned list
and then also returns its list result to its own caller, etc .. up to the
primary one which finally will emit the resulting rows to the client.

What was not good before is that for queries which produced very very
numerous rows then the result was first entirely built in the livestatus
process (blocking any other client from executing queries) which could take
lot lot of time for very numerous rows of result so.. that was a major
problem.

Now what has already changed within commit d61b659
d61b659
is that each(or most) of the functions in the process of actually executing
the queries, now no longer returns list of the resulting rows, but they
have been transformed in generators functions ..

So that the response can in fact be like "streamed".
But a generator comes at a small cost. More generators in the pipe (which
is the case) come at an higher cost.

I may have changed too much of the functions which returned list to
generators, or else there could have some simplification also.

That doesn't explain the enormous difference you actually have though.

—
Reply to this email directly or view it on GitHub
#58 (comment)
.

Grégory Starck · Answer 21 · Mon May 04 2015 21:23:07 GMT+0800 (China Standard Time)

Is it possible to revert only the generator part to see if it comes from
that ?

it comes from that ;) well at least the difference I see on my side, not the huge one you get, which should be something else.

I will give it a check.. but I have also began working on a branch to improve that part, not by undoing things but the goal is to improve the global or average case performance.

Grégory Starck · Answer 22 · Mon May 04 2015 23:31:55 GMT+0800 (China Standard Time)

seems I've got something ok.. gonna be able to retry the "heavy" queries to benchmark the result..

Grégory Starck · Answer 23 · Tue May 05 2015 06:15:52 GMT+0800 (China Standard Time)

@olivierHa : hold on ; I've something which seems very good now (well quite better ;) )

if you want to test if fast it's here : https://github.com/shinken-monitoring/mod-livestatus/tree/enh_process_response_in_batch

Olivier H · Answer 24 · Tue May 05 2015 06:24:32 GMT+0800 (China Standard Time)

I will test that tomorrow. Keep you inform
Le 5 mai 2015 12:16 AM, "Grégory Starck" notifications@github.com a
écrit :

@olivierHa https://github.com/olivierHa : hold on ; I've something
which seems very good now (well quite better ;) )

if you want to test if fast it's here :
https://github.com/shinken-monitoring/mod-livestatus/tree/enh_process_response_in_batch

—
Reply to this email directly or view it on GitHub
#58 (comment)
.

Grégory Starck · Answer 25 · Tue May 05 2015 07:22:43 GMT+0800 (China Standard Time)

I'll probably make a second (mutch cleaner) branch though..

Olivier H · Answer 26 · Tue May 05 2015 19:59:25 GMT+0800 (China Standard Time)

Do you want me to try on this branch ? Or i wait the new one ?

2015-05-05 1:22 GMT+02:00 Grégory Starck notifications@github.com:

I'll probably make a second (mutch cleaner) branch though..

—
Reply to this email directly or view it on GitHub
#58 (comment)
.

Grégory Starck · Answer 27 · Tue May 05 2015 20:20:03 GMT+0800 (China Standard Time)

you can already give it a test .. even if the travis build failed on it.. I'll have to check.

Grégory Starck · Answer 28 · Tue May 05 2015 22:37:55 GMT+0800 (China Standard Time)

for info: I just pushed 2 more (fix-)commits on the branch ;; which now pass completly the travis build..

hourrah !!! ;) but damn, I'll have to double or triple check it..

Grégory Starck · Answer 29 · Wed May 06 2015 23:10:52 GMT+0800 (China Standard Time)

@olivierHa
no test yet ?
if/when you have time for that, could you also try this branch : https://github.com/shinken-monitoring/mod-livestatus/tree/enh_perf ?
thx.

Olivier H · Answer 30 · Wed May 06 2015 23:13:37 GMT+0800 (China Standard Time)

You didn't see my irc messages :(

2015-05-06 17:10 GMT+02:00 Grégory Starck notifications@github.com:

@olivierHa https://github.com/olivierHa
no test yet ?
if/when you have time for that, could you also try this branch :
https://github.com/shinken-monitoring/mod-livestatus/tree/enh_perf ?
thx.

—
Reply to this email directly or view it on GitHub
#58 (comment)
.

Olivier H · Answer 31 · Thu May 07 2015 04:17:20 GMT+0800 (China Standard Time)

Indeed, it is a lot better !