Idle Timeout for Elastic Load Balancers should be configurable

Question

Idle Timeout for Elastic Load Balancers should be configurable

hobti01 opened this issue 7 years ago · comments

When using Helm/Tiller, deployments make take longer than the 60 second default timeout when communicating with the API Server. While the precise issue is related to the API Server, it is logical to allow configuration of all ELBs.

Ross Fairbanks · Answer 1 · Tue Oct 17 2017 23:44:51 GMT+0800 (China Standard Time)

@hobti01 Thanks so much for raising this and the PRs!

Yes, I can see how the timeout could affect Helm deployments. I'll discuss this with the team and get back to you.

Ping @puja108 @fgimenez

Roman · Answer 2 · Wed Oct 18 2017 22:31:23 GMT+0800 (China Standard Time)

Looks like this also caused calico (confd) issues. If cluster was scaled new workers can not properly join BGP peers, because confd missed etcd events and did not reconfigure bird. We don't see this issue in on-prem guest clusters, because load-balancer there don't have this kind of timeout or has really long one (e.g. multiple hours).

Roman · Answer 3 · Wed Oct 18 2017 22:33:15 GMT+0800 (China Standard Time)

etcd uses other load-balancer, not one with ingress, so I'll create other issue.

Ross Fairbanks · Answer 4 · Wed Oct 25 2017 15:38:32 GMT+0800 (China Standard Time)

The aws-operator change is deployed. We still need to set the timeout values in the cluster custom object. I'd propose

API - 300 seconds
Etcd - 3600 seconds
Ingress - 300 seconds

Setting etcd to 3600 secs will resolve #464 raised by Roman.

@r7vme @calvix Are you OK with these values?

@hobti01 Is 300 secs high enough for apiserver to resolve your Helm problems?

Roman · Answer 5 · Wed Oct 25 2017 16:05:07 GMT+0800 (China Standard Time)

Etcd 3600 is OK, but i'm not sure about others.

AWS idle timeout is last resort for dropping stuck connections (you have also kernel TCP stuff, application level logic.). From one side having short idle timeout it can save us from some attacks. From the other side API has a lot of functionality that uses long-living connections (e.g. watches, logs, execs ).

I've checked google for kind of "best practices" for k8s api. Only found that DEIS recommend to use 1200 sec. So from my side i think it also makes sense to start with 1200sec for API and Ingress.

Ross Fairbanks · Answer 6 · Wed Oct 25 2017 16:18:31 GMT+0800 (China Standard Time)

@r7vme Thanks, OK let's go with 1200 for api and ingress. I'll update kubernetesd to set these values.

Tim Hobbs · Answer 7 · Wed Oct 25 2017 16:22:39 GMT+0800 (China Standard Time)

@rossf7 deploying elasticstack with master and data nodes exceeds 300 seconds. Right now we are using a helm timeout of 600 and api server timeout of 900 which is ok. We'd be happy with defaults of 1200.

The current aws-operator seems to keep resetting the timeout to 60 seconds ;)

Ross Fairbanks · Answer 8 · Wed Oct 25 2017 16:29:34 GMT+0800 (China Standard Time)

@hobti01 Yes I'm afraid it will be resetting to 60 secs because the idle timeouts are not set in the cluster custom object.

Once the kubernetesd change is made the timeouts will be set for new clusters. I'll check if we can set the timeouts for existing clusters.