lukas-vlcek / bigdesk

Live charts and statistics for Elasticsearch cluster.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

/_cluster/state scales poorly

robottaway opened this issue · comments

Hi loving the plugin! We started using it about a month ago and it has allowed my team to get a real time view on what our nodes are doing. Hoping I can help to fix an issue we have with using it on more sizable clusters.

For any of our larger clusters we have many customers, and many indexes (~100). Calling "_cluster/state" is a guaranteed way to kill a browser. it is currently over 11 mega bytes of data. Chrome kills the page, other browsers even crash.

It maybe possible to use the filters on cluster state to reduce the size while still maintaining functionality?

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-state.html

If I switch on no metadata filter the size drops down to 665Kb. It's much faster and possibly would scale to clusters much larger than ours. Not sure that's doable.

I don't want to preemptively fork and code. If you think this is something that warrants attention let me know. I will be extremely happy to add my coding efforts.

Hi, I did not test Bigdesk on sizeable cluster myself thus I am glad you report issues. Such feedback is very useful. 11MB per single _cluster/state response is really much for web browsers. If we are not using all the info then let's try to cut the size down by filtering. Agreed. Let me see what we can do about this.

It would also help me if you can share some of the following:

  • version of Elasticsearch that you need to be supported
  • number of nodes and indices per node (just my curiosity)
  • typical Bigdesk refresh rate and history window size

Hi Lukas, I'm heading out to lunch with co-workers but when I get back I will gather this information for you... and Thank you for such fast response!

Versions range between 0.90.2 - 0.90.5

The largest cluster has ~120 indexes (one per client on our platform) with sizes ranging from < Mb to ~100Mb (only one near that size).

The refresh rate and history size are always left at the defaults, 2 seconds and 5 minutes I believe.

I should note we have a fair amount of custom settings and mappings per index. I think this might add a fair amount to the size of the cluster state resource.

screen shot 2013-10-14 at 11 01 18 am

The above is the resource usage of a node, we only have 1 bigdesk open, no other traffic. The instance is a EC2 m2 4xl instance type. Seems we're using up quite a bit of resources just trying to keep bigdesk going! I imagine it's the previously mentioned problem causing such CPU usage.

I spent last week upgrading our clusters. We are on 0.90.5 across the board now. I can help test and code.

It could be. I think we can really try to downsize the amount of the data Bigdesk pulls now.

Fixed by #41
Will be part of v2.2.2 release. I will release it in one day.

I have just released version 2.2.2
Can you test it please?

On it, thanks!

When I load the page it's actually responsive now. After running for 10+ minutes it becomes super slow and eventually crashes. I'll dig into that. For fun here is the "experimental cluster diagram"

screen shot 2013-10-14 at 4 16 01 pm

Do I need to have this version of bigdesk on every node? I only installed on a sneak node which is out of discovery on the edge of the cluster where only I am using it for bigdesk viewing. Maybe that is the trouble?

Attaching a screen grab of the chrome dev tools networking output. There are some 10+ mb files being returned, thinking it's because some of these nodes are still on the old version of bigdesk.

screen shot 2013-10-14 at 4 20 10 pm

OK I think I accidentally installed the same 2.2.0 version. Let me try this again.

Got the correct 2.2.2 version, looks a lot better. Seeing 1.6 mb of data on the _status/_all endpoint. Wonder if further trimming of that endpoint would help? Otherwise it's much snappier right away. I'm gonna run it for awhile.

screen shot 2013-10-14 at 4 55 57 pm

ad #40 (comment) - We are getting into an abstract art territory with clusters this big. Soon we can open a gallery!

ad #40 (comment) - you only need to have this plugin on a single node which you connect your browser to. Or you do not even need to install Bigdesk on any of the nodes at all. You can just download (or git clone) the Bigdesk on your FS and open the index.html and point it to the URL of one of the cluster node endpoints (you don't even need to download the Bigdesk - just run it from the web and select correct version of Bigdesk from http://bigdesk.org/v/).

For instance you can open the following URL in your browser:

http://bigdesk.org/v/master/?endpoint=http://localhost:9200&connect=true&refresh=5000#cluster

Assuming that

  • you want to use Bigdesk master version. In your case you can replace it with .../v/2.2.2/...
  • ES node is running on http://localhost:9200 (change that to correct ES endpoint base)
  • you want to use 5 sec refresh interval instead of default 2 sec
  • you want the Bigdesk to auto connect (so you do not need to click the connect button explicitly)
  • switch directly to cluster tab

there are even more URL params that you can use.

We can try to find more opportunities for cutting the data. Also I am thinking if it would make sense to open a ticket in Elasticsearch and ask to add an option to compress the REST end point output. Still transferring 1.6MB over the HTTP from ES REST endpoint is a lot. Given this is just a simple text data the compression could help a lot IMO.

Those diagrams get pretty cool looking. We have 2 large clusters, one with 150+ indices and another with 200+. ES has been really stable for us.

Would it help if I could get a copy of that 1.6mb file? I can probably do that early tomorrow when I get to work.