/_cluster/state scales poorly

Question

/_cluster/state scales poorly

robottaway opened this issue 11 years ago · comments

Hi loving the plugin! We started using it about a month ago and it has allowed my team to get a real time view on what our nodes are doing. Hoping I can help to fix an issue we have with using it on more sizable clusters.

For any of our larger clusters we have many customers, and many indexes (~100). Calling "_cluster/state" is a guaranteed way to kill a browser. it is currently over 11 mega bytes of data. Chrome kills the page, other browsers even crash.

It maybe possible to use the filters on cluster state to reduce the size while still maintaining functionality?

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-state.html

If I switch on no metadata filter the size drops down to 665Kb. It's much faster and possibly would scale to clusters much larger than ours. Not sure that's doable.

I don't want to preemptively fork and code. If you think this is something that warrants attention let me know. I will be extremely happy to add my coding efforts.

Lukáš Vlček · Answer 1 · Sat Oct 05 2013 01:30:02 GMT+0800 (China Standard Time)

Hi, I did not test Bigdesk on sizeable cluster myself thus I am glad you report issues. Such feedback is very useful. 11MB per single _cluster/state response is really much for web browsers. If we are not using all the info then let's try to cut the size down by filtering. Agreed. Let me see what we can do about this.

It would also help me if you can share some of the following:

version of Elasticsearch that you need to be supported
number of nodes and indices per node (just my curiosity)
typical Bigdesk refresh rate and history window size

Rob Ottaway · Answer 2 · Sat Oct 05 2013 03:02:02 GMT+0800 (China Standard Time)

Hi Lukas, I'm heading out to lunch with co-workers but when I get back I will gather this information for you... and Thank you for such fast response!

Rob Ottaway · Answer 3 · Sat Oct 05 2013 04:48:17 GMT+0800 (China Standard Time)

Versions range between 0.90.2 - 0.90.5

The largest cluster has ~120 indexes (one per client on our platform) with sizes ranging from < Mb to ~100Mb (only one near that size).

The refresh rate and history size are always left at the defaults, 2 seconds and 5 minutes I believe.

Rob Ottaway · Answer 4 · Sat Oct 05 2013 04:52:23 GMT+0800 (China Standard Time)

I should note we have a fair amount of custom settings and mappings per index. I think this might add a fair amount to the size of the cluster state resource.

Rob Ottaway · Answer 5 · Tue Oct 15 2013 02:08:06 GMT+0800 (China Standard Time)

The above is the resource usage of a node, we only have 1 bigdesk open, no other traffic. The instance is a EC2 m2 4xl instance type. Seems we're using up quite a bit of resources just trying to keep bigdesk going! I imagine it's the previously mentioned problem causing such CPU usage.

I spent last week upgrading our clusters. We are on 0.90.5 across the board now. I can help test and code.

Lukáš Vlček · Answer 6 · Tue Oct 15 2013 02:19:59 GMT+0800 (China Standard Time)

It could be. I think we can really try to downsize the amount of the data Bigdesk pulls now.

Lukáš Vlček · Answer 7 · Tue Oct 15 2013 03:48:24 GMT+0800 (China Standard Time)

Fixed by #41
Will be part of v2.2.2 release. I will release it in one day.

Lukáš Vlček · Answer 8 · Tue Oct 15 2013 05:52:36 GMT+0800 (China Standard Time)

I have just released version 2.2.2
Can you test it please?

Rob Ottaway · Answer 9 · Tue Oct 15 2013 06:06:14 GMT+0800 (China Standard Time)

On it, thanks!

Rob Ottaway · Answer 10 · Tue Oct 15 2013 07:17:26 GMT+0800 (China Standard Time)

When I load the page it's actually responsive now. After running for 10+ minutes it becomes super slow and eventually crashes. I'll dig into that. For fun here is the "experimental cluster diagram"

Rob Ottaway · Answer 11 · Tue Oct 15 2013 07:21:15 GMT+0800 (China Standard Time)

Do I need to have this version of bigdesk on every node? I only installed on a sneak node which is out of discovery on the edge of the cluster where only I am using it for bigdesk viewing. Maybe that is the trouble?

Rob Ottaway · Answer 12 · Tue Oct 15 2013 07:22:13 GMT+0800 (China Standard Time)

Attaching a screen grab of the chrome dev tools networking output. There are some 10+ mb files being returned, thinking it's because some of these nodes are still on the old version of bigdesk.

Rob Ottaway · Answer 13 · Tue Oct 15 2013 07:48:53 GMT+0800 (China Standard Time)

OK I think I accidentally installed the same 2.2.0 version. Let me try this again.

Rob Ottaway · Answer 14 · Tue Oct 15 2013 07:57:29 GMT+0800 (China Standard Time)

Got the correct 2.2.2 version, looks a lot better. Seeing 1.6 mb of data on the _status/_all endpoint. Wonder if further trimming of that endpoint would help? Otherwise it's much snappier right away. I'm gonna run it for awhile.

Lukáš Vlček · Answer 15 · Tue Oct 15 2013 11:17:21 GMT+0800 (China Standard Time)

ad #40 (comment) - We are getting into an abstract art territory with clusters this big. Soon we can open a gallery!

ad #40 (comment) - you only need to have this plugin on a single node which you connect your browser to. Or you do not even need to install Bigdesk on any of the nodes at all. You can just download (or git clone) the Bigdesk on your FS and open the index.html and point it to the URL of one of the cluster node endpoints (you don't even need to download the Bigdesk - just run it from the web and select correct version of Bigdesk from http://bigdesk.org/v/).

For instance you can open the following URL in your browser:

http://bigdesk.org/v/master/?endpoint=http://localhost:9200&connect=true&refresh=5000#cluster

Assuming that

you want to use Bigdesk master version. In your case you can replace it with .../v/2.2.2/...
ES node is running on http://localhost:9200 (change that to correct ES endpoint base)
you want to use 5 sec refresh interval instead of default 2 sec
you want the Bigdesk to auto connect (so you do not need to click the connect button explicitly)
switch directly to cluster tab

there are even more URL params that you can use.

Lukáš Vlček · Answer 16 · Tue Oct 15 2013 11:35:35 GMT+0800 (China Standard Time)

We can try to find more opportunities for cutting the data. Also I am thinking if it would make sense to open a ticket in Elasticsearch and ask to add an option to compress the REST end point output. Still transferring 1.6MB over the HTTP from ES REST endpoint is a lot. Given this is just a simple text data the compression could help a lot IMO.

Rob Ottaway · Answer 17 · Tue Oct 15 2013 12:09:42 GMT+0800 (China Standard Time)

Those diagrams get pretty cool looking. We have 2 large clusters, one with 150+ indices and another with 200+. ES has been really stable for us.

Would it help if I could get a copy of that 1.6mb file? I can probably do that early tomorrow when I get to work.