Idle timeout for etcd should be at least 1 hour

Question

Idle timeout for etcd should be at least 1 hour

r7vme opened this issue 7 years ago · comments

Many services "watch" etcd, so it's expected that connections to etcd will not be dropped after 60sec (current elb idle timeout).

I see two solutions:

increase idle timeout to 1 hour
switch from ELB to DNS cname which directly point to master node

This timeout issue probably is the root cause of Calico/Confd issue, when new nodes can not join Calico, becuase existing nodes missed events from etcd. https://github.com/giantswarm/giantswarm/issues/1687#issuecomment-328551514

That happens in customer guest clusters periodically.

cc: @teemow @puja108 @rossf7

calvix · Answer 1 · Wed Oct 18 2017 22:57:44 GMT+0800 (China Standard Time)

is not possible atm (we discussed this on sig-updates) so let's do start with 1, should be super simple hack

Puja · Answer 2 · Wed Oct 18 2017 23:06:54 GMT+0800 (China Standard Time)

there's some PRs by Tim from IC consult that will make these timeouts configurable through the TPR:
https://github.com/giantswarm/awstpr/pull/45/files

Ross Fairbanks · Answer 3 · Wed Oct 18 2017 23:15:35 GMT+0800 (China Standard Time)

Yes the change from Tim @ IC Consult is to use the same timeout for all 3 ELBs. If there is no need to have separate values then I think we can go with that.

Puja · Answer 4 · Wed Oct 18 2017 23:25:19 GMT+0800 (China Standard Time)

Oh, I thought this is separate values, I would vote for separate values as I'm not so sure we want to just increase to maximum (60 min) for all ELBs.

Ross Fairbanks · Answer 5 · Wed Oct 18 2017 23:28:43 GMT+0800 (China Standard Time)

The awstpr change does have separate timeouts. Ignore me on this!

Ross Fairbanks · Answer 6 · Wed Oct 25 2017 17:37:46 GMT+0800 (China Standard Time)

Idle timeout set to 3600 secs in #445