[Livestatus] Unexpected error after an arbiter reload

Question

[Livestatus] Unexpected error after an arbiter reload

fullmetalucard opened this issue 9 years ago · comments

Hi,

Since a few days, we encounter a new problem related to livestatus.
We're running a 2.4.1 version under Debian 8 with the following architecture:

1 shinken master (with no poller running on it)
4 pollers
4 realms
Thruk 2.0 connected to 5 livestatus (master and 4 realms).

The fact is that when when we launch an arbiter-reload, the broker gets mad because of the livestatus module. Thruk interface becomes unusable although the livestatus still seems to be up.
Here is an example of the traceback in brokerd.log:

[1447026298] ERROR: [broker-1] [Livestatus] Unexpected error during process of request 'GET services\nColumns: accept_passive_checks acknowledged action_url action_url_expanded active_checks_enabled check_command check_interval check_options check_period check_type checks_enabled comments current_attempt current_notification_number description event_handler event_handler_enabled custom_variable_names custom_variable_values execution_time first_notification_delay flap_detection_enabled groups has_been_checked high_flap_threshold host_acknowledged host_action_url_expanded host_active_checks_enabled host_address host_alias host_checks_enabled host_check_type host_latency host_plugin_output host_perf_data host_current_attempt host_check_command host_comments host_groups host_has_been_checked host_icon_image_expanded host_icon_image_alt host_is_executing host_is_flapping host_name host_notes_url_expanded host_notifications_enabled host_scheduled_downtime_depth host_state host_accept_passive_checks host_last_state_change icon_image icon_image_alt icon_image_expanded is_executing is_flapping last_check last_notification last_state_change latency long_plugin_output low_flap_threshold max_check_attempts next_check notes notes_expanded notes_url notes_url_expanded notification_interval notification_period notifications_enabled obsess_over_service percent_state_change perf_data plugin_output process_performance_data retry_interval scheduled_downtime_depth state state_type modified_attributes_list last_time_critical last_time_ok last_time_unknown last_time_warning display_name host_display_name host_custom_variable_names host_custom_variable_values in_check_period in_notification_period host_parents\nFilter: host_has_been_checked = 0\nFilter: host_has_been_checked = 1\nFilter: host_state = 0\nAnd: 2\nOr: 2\nFilter: host_scheduled_downtime_depth = 0\nFilter: host_acknowledged = 0\nAnd: 2\nFilter: has_been_checked = 1\nFilter: state = 1\nAnd: 2\nFilter: has_been_checked = 1\nFilter: state = 3\nAnd: 2\nFilter: has_been_checked = 1\nFilter: state = 2\nAnd: 2\nOr: 3\nFilter: scheduled_downtime_depth = 0\nFilter: acknowledged = 0\nAnd: 2\nAnd: 4\nOutputFormat: json\nResponseHeader: fixed16\n\n' : 115536
[1447026298] ERROR: [broker-1] [Livestatus] Back trace of this exception: Traceback (most recent call last):
  File "/var/lib/shinken/modules/livestatus/livestatus_obj.py", line 74, in handle_request
    return self.handle_request_and_fail(data)
  File "/var/lib/shinken/modules/livestatus/livestatus_obj.py", line 135, in handle_request_and_fail
    output, keepalive = query.process_query()
  File "/var/lib/shinken/modules/livestatus/livestatus_query.py", line 283, in process_query
    return self.response.respond()
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 147, in respond
    responselength = 1 + self.get_response_len() # 1 for the final '\n'
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 142, in get_response_len
    if isinstance(rsp, LiveStatusListResponse)
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 83, in total_len
    for generated_data in value:
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 278, in make_live_data_generator
    for value in self.make_live_data_generator2(result, columns, aliases):
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 224, in make_live_data_generator2
    item = next(result)
  File "/var/lib/shinken/modules/livestatus/livestatus_query.py", line 46, in gen_filtered
    for val in values:
  File "/var/lib/shinken/modules/livestatus/livestatus_regenerator.py", line 125, in itersorted
    yield self.items[id]
KeyError: 115536

The only workaround we found consists in restarting the broker each time we want to reload the arbiter. (and this workaround leads to high memory leaks..)
So to not replace a problem with another, we searched and found our issue could be related to issue #47

We tried to manually do the GET requests when everything goes fine and livestatus answers correctly:

echo -e "GET hosts\n\n" | netcat localhost 50000

(works too when doing queries about contacts, services, etc)

Another thing we noticed is that it may occur when livestatus is often asked by thruk, because we never have those errors during the night or weekend. So it might be related to the number of user/operators connected to thruk.

Any help would be appreciated,

Regards

fullmetalucard · Answer 1 · Mon Nov 23 2015 20:02:27 GMT+0800 (China Standard Time)

Hello, we still have this annoying problem. It gets worse as the infrastructure monitors more and more hosts.

Regards,

Olivier H · Answer 2 · Tue Nov 24 2015 00:26:26 GMT+0800 (China Standard Time)

I don't know about this specific issue, but why restarting broker is
leading to memory issues ?
I am currently restarting broker(s) each time I restart arbiter, and I
don't see any memory leaks so far.

Could you be more specific ?

2015-11-23 13:02 GMT+01:00 fullmetalucard notifications@github.com:

Hello, we still have this annoying problem. It gets worse as the
infrastructure monitors more and more hosts.

Regards,

—
Reply to this email directly or view it on GitHub
#67 (comment)
.

fullmetalucard · Answer 3 · Fri Nov 27 2015 19:09:52 GMT+0800 (China Standard Time)

Hi, it's clearly a bug related to big infrastructures.
We noticed this occurs only on brokers who manage a lots of checks.

To complete informations about our workaround, we made an alias shinken_reload who does this:

config check
arbiter reload
wait 120 seconds
broker restart

With this workaround, the platform seems more stable.
Our guess is that by default arbiter is not fully ready to dispatch orders after config check, we have to wait so it can communicate successfully with the broker.

I may also precise our shinken master isn't especially heavy loaded (load average = 2 on 8 PPC processors)
stats: more than 600 hosts/7000 checks

Regards

Vesla · Answer 4 · Tue Feb 09 2016 03:52:05 GMT+0800 (China Standard Time)

Hi,

I have EXACTLY same architecture and same problem.
I got it on a second site now (not PPC but 20 realms).

What should i do to debug this (python debugger or else) ?

Anyway thank's shinken is the best solution .

Some logs when bug occurs :

here :

  File "/var/lib/shinken/modules/livestatus/livestatus_obj.py", line 74, in handle_request
    return self.handle_request_and_fail(data)
  File "/var/lib/shinken/modules/livestatus/livestatus_obj.py", line 135, in handle_request_and_fail
    output, keepalive = query.process_query()
  File "/var/lib/shinken/modules/livestatus/livestatus_query.py", line 283, in process_query
    return self.response.respond()
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 147, in respond
    responselength = 1 + self.get_response_len() # 1 for the final '\n'
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 142, in get_response_len
    if isinstance(rsp, LiveStatusListResponse)
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 83, in total_len
    for generated_data in value:
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 278, in make_live_data_generator
    for value in self.make_live_data_generator2(result, columns, aliases):
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 224, in make_live_data_generator2
    item = next(result)
  File "/var/lib/shinken/modules/livestatus/livestatus_query.py", line 46, in gen_filtered
    for val in values:
  File "/var/lib/shinken/modules/livestatus/livestatus_regenerator.py", line 125, in itersorted
    yield self.items[id]

and here :

[1454934157] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934157] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934157] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934157] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934161] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934161] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934161] INFO: [broker-1] Connection OK to the scheduler scheduler-1
[1454934164] ERROR: [broker-1] [Livestatus Query] Error: 'Contactgroups' object has no attribute '__itersorted__'
[1454934165] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934165] ERROR: [broker-1] [Livestatus Query] Received a line of input which i can't handle: 'quit'
[1454934165] ERROR: [broker-1] [Livestatus Query] Received a line of input which i can't handle: 'exit'
[1454934165] WARNING: [broker-1] [Livestatus Query Metainfo] Received a line of input which i can't handle: 'quit'
[1454934165] WARNING: [broker-1] [Livestatus Query Metainfo] Received a line of input which i can't handle: 'exit'
[1454934173] ERROR: [broker-1] [Livestatus Query] Error: 'Contacts' object has no attribute '__itersorted__'
[1454934173] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934177] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934177] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934177] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934183] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934183] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934184] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934186] INFO: [broker-1] The module None is asking me to get all initial data from the scheduler 0
[1454934186] INFO: [broker-1] The module npcdmod is asking me to get all initial data from the scheduler 0
[1454934190] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934190] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934192] INFO: [broker-1] Connection OK to the scheduler scheduler-1
[1454934194] ERROR: [broker-1] [Livestatus Query] Error: 'Contactgroups' object has no attribute '__itersorted__'
[1454934195] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934198] ERROR: [broker-1] [Livestatus Query] Error: 'Contacts' object has no attribute '__itersorted__'
[1454934198] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934198] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934201] ERROR: [broker-1] [Livestatus Query] Error: 'Contactgroups' object has no attribute '__itersorted__'
[1454934204] ERROR: [broker-1] [Livestatus Query] Error: 'Contacts' object has no attribute '__itersorted__'

fullmetalucard · Answer 5 · Tue Feb 09 2016 15:57:45 GMT+0800 (China Standard Time)

Hi, I knew i was not the only one ;)
I was also sure that it wasn't related to PPC architecture. We still have this annoying problem.
Our workaround works randomly. Our guess is that when livestatus is constantly accessed (by thruk sessions) during a reload, it becomes mad.
We are also wondering of why livestatus ' updates seems to have been abandoned. No more commits since last september and still a lot of issues with it. Is it for commercial purpose/reasons..?
Many people use thruk/livestatus and are not familiar or ready to migrate their interface on webui..
@olivierHa : you reported a memory issue too #63 As far as i know, broker launches the livestatus module. The problem you reported could be linked. I insist on the fact that this might clearly be related to big infrastructures with many realms and hosts because we don't have this problem on smaller infrastructures with same OS and shinken versions.

Well the only thing i'm sure is that someting has to be done with livestatus.

Thanks in advance for your help and patience.

Regards,

webladen · Answer 6 · Tue Feb 09 2016 16:20:32 GMT+0800 (China Standard Time)

Hi,

I have a fresh install of shinken (2.4.2) with webui2 and livestatus for check_mk using.
I have the same problem since I have activated livestatus (increasingly ram consumption).
Yesterday I disable livestatus from broker configuration (but keep webui2 activated)

below the graph show the problem.
.

I need livestatus for check_mk and nagvis use.

How can I fix it?

webladen · Answer 7 · Wed Feb 17 2016 16:23:24 GMT+0800 (China Standard Time)

Anybody has an idea ?
The problem is still there and it is very blocking for us.

Jean-Tiare Le Bigot · Answer 8 · Fri Mar 11 2016 19:47:02 GMT+0800 (China Standard Time)

I've seen a similar behavior on one of our shinken instance. It is reliably triggered with /etc/init.d/shinken-arbiter reload.

For now, I've done an ugly patch to fix the symptoms:

diff -u /usr/local/lib/python2.7/dist-packages/shinken/misc/regenerator.py.old /usr/local/lib/python2.7/dist-packages/shinken/misc/regenerator.py
--- /usr/local/lib/python2.7/dist-packages/shinken/misc/regenerator.py.old  2016-03-09 17:39:57.874430134 +0000
+++ /usr/local/lib/python2.7/dist-packages/shinken/misc/regenerator.py  2016-03-09 17:39:12.920622557 +0000
@@ -503,7 +503,7 @@
         # Clean hosts from hosts and hostgroups
         for h in to_del_h:
             safe_print("Deleting", h.get_name())
-            del self.hosts[h.id]
+            #del self.hosts[h.id]

         # Now clean all hostgroups too
         for hg in self.hostgroups:
@@ -514,7 +514,7 @@

         for s in to_del_srv:
             safe_print("Deleting", s.get_full_name())
-            del self.services[s.id]
+            #del self.services[s.id]

         # Now clean service groups
         for sg in self.servicegroups:

This is by no mean a fix, so I'm not submitting a PR. I'm also checking the installation itself.

Vesla · Answer 9 · Mon Mar 21 2016 00:48:05 GMT+0800 (China Standard Time)

Could you tell me in which version cherrypy are you ?
pip list |grep Cherry

fullmetalucard · Answer 10 · Mon Mar 21 2016 15:44:21 GMT+0800 (China Standard Time)

Hi, we're in CherryPy (3.8.0)

Diogo Uchoas · Answer 11 · Fri Jul 01 2016 02:14:44 GMT+0800 (China Standard Time)

Any updates?
This issue is getting very annoying now that we passed 10k services monitored.

Vesla · Answer 12 · Sun Jul 17 2016 06:45:28 GMT+0800 (China Standard Time)

Upgrade 2.4.3 man.

Diogo Uchoas · Answer 13 · Tue Jul 19 2016 04:15:10 GMT+0800 (China Standard Time)

We're already on 2.4.3 man.
The issue still exists.

floppy84 · Answer 14 · Wed Sep 21 2016 23:09:56 GMT+0800 (China Standard Time)

Hi all,
i got the same issue, is there a fix for that? i am on 2.4.3 too

krpt · Answer 15 · Thu Sep 22 2016 22:48:53 GMT+0800 (China Standard Time)

Same here, would be grateful for a fix

tandrez · Answer 16 · Thu Jan 26 2017 00:02:51 GMT+0800 (China Standard Time)

Hello,

I got the same issue for a professional project and it's very annoying towards our customer. We're monitoring about 5K hosts and 20K services!
The workaround is rarely working.
Is there any hope to have a fix?

Thanks in advance for your help!

Oscar Muñoz · Answer 17 · Sun May 21 2017 08:43:54 GMT+0800 (China Standard Time)

+1 Hello all, same here.

Caez83 · Answer 18 · Thu Jun 15 2017 23:44:09 GMT+0800 (China Standard Time)

I've the same bug since one month