Consider upstreaming

Question

Consider upstreaming

bprashanth opened this issue 8 years ago · comments

Since this is based on nginx, and we have an nginx backend (https://github.com/kubernetes/ingress/tree/master/controllers/nginx), and we're working on a generic controller with pluggable backends (https://github.com/kubernetes/ingress/blob/master/core/pkg/ingress/doc.go#L19), it seems like a good way to collaborate, for greater justice.

Even if you don't feel like it's worth your time right now, it might make sense to streamline how feed and the standard nginx backend handle nginx specific features.

Craig Box · Answer 1 · Thu Dec 08 2016 02:04:13 GMT+0800 (China Standard Time)

@balooo: Hello and thank you for your time today :-)

James Ravn · Answer 2 · Thu Dec 08 2016 20:05:39 GMT+0800 (China Standard Time)

This sounds good to me. I'll talk to the team and see what they think. We are going to be prod live very soon so we'll have to stick with this project for at least the short term. We still have some work planned for feed that we need to do very soon (moving to client-go, using endpoints instead of services), after that we can look into switching to upstream.

Prashanth B · Answer 3 · Fri Dec 09 2016 01:04:30 GMT+0800 (China Standard Time)

fyi the upstream interface should give you something like a snapshot of urls+services+endpoints from its local cache on each update notification from apiserver, so you can pick if you want to go straight to svc vip, svc dns, svc nodeport or endpoints

James Ravn · Answer 4 · Fri Dec 16 2016 19:01:59 GMT+0800 (China Standard Time)

@bprashanth We're probably going to back away from adding endpoints to nginx. The problem is nginx reload is quite problematic when done frequently (such as would happen with endpoint updates):

When keepalive connections are involved, clients will get stale connections a lot - as nginx simply closes any connections it considers idle on a reload/graceful stop, without waiting for the keepalive_timeout.
Memory usage can balloon very quickly under load as old nginx workers will stick around after a reload.

So I think using endpoints with nginx is going to be a bad idea. The problem of persistent connections + endpoint updates needs to be solved at the kube-proxy layer ideally.

Prashanth B · Answer 5 · Sat Dec 17 2016 01:54:52 GMT+0800 (China Standard Time)

I'd expect both sides to follow the tcp protocol? As long as nginx sends a fin and does the normal close handshake, the client should tear down the connection, keepalive or not, unless there's a stateful NAT inbetween that disagrees with tcp timeouts?

Maybe you can describe what you see on a tcpdump in a controlled environment (or is this too hard to hit in testing)?

James Ravn · Answer 6 · Sat Dec 17 2016 02:18:13 GMT+0800 (China Standard Time)

We're using an ELB in front of nginx, and we'll see ELB 504s getting returned while under load when nginx does a reload. It isn't very many though, because ingress updates are pretty rare. As the ELB is a black box I can't see what it's doing.

I will try to duplicate this and capture logs to understand it better.

James Ravn · Answer 7 · Sat Dec 17 2016 02:24:52 GMT+0800 (China Standard Time)

I think something else may be going wrong, so I'll investigate further. Could be an ELB bug. I think somewhere in the ELB docs it says the ELB needs to close the backend connections. Akamai has a similar restriction.

James Ravn · Answer 8 · Wed Dec 21 2016 00:40:10 GMT+0800 (China Standard Time)

@bprashanth tcpdump shows the nginx host sending tons of tcp resets to the ELB when a reload occurs. These translate directly into 504s returned from the ELB. I can reproduce as well hitting nginx directly with wrk using http/1.1 keepalives, I get connection resets. It seems so far nginx reload isn't as graceful as one would hope. I would expect it to close its connections gracefully, but so far it seems nginx simply closes the socket when idle rather than wait for the connection to terminate on the ELB/client side.

Prashanth B · Answer 9 · Wed Dec 21 2016 01:36:16 GMT+0800 (China Standard Time)

how are you reloading it? (http://nginx.org/en/docs/control.html)

James Ravn · Answer 10 · Wed Dec 21 2016 01:40:12 GMT+0800 (China Standard Time)

We send a HUP to the master process. I'm pretty sure this is all expected, http 1.1 rfc 2616 says clients should be able to handle async closes (non synchronous / graceful). Otherwise nginx workers would have to wait for some grace period to be sure the connections close.

I spoke with AWS support and they suggested I use the new application load balancers instead. I tried one out, and indeed it works fine with nginx reloads - not a single 504. Apparently the ALBs don't reuse persistent connections, instead it looks like they batch multiple requests through a connection then close it. That seems to solve the problem for us - so with ALB I can start updating endpoints with confidence :).

Prashanth B · Answer 11 · Wed Dec 21 2016 01:47:03 GMT+0800 (China Standard Time)

RST is still kind of rude, I get the feeling there's a more obvious fin/fin-wait-timeout option but I'd have to dig up nginx config a bit. Great that you have it working. Unfortunately Service.Type=LoadBalancer doesn't setup an ALB, so maybe this is how we should finally manifest Ingress on AWS?

Right now it's an ELB created by Service.Type=LB over the nginx ingress controller. Is setting up an ALB just like setting up an ELB + one last step to create something like a url map? or are all the resources in the pipeline completely different (as is the case on GCE)

James Ravn · Answer 12 · Wed Dec 21 2016 01:51:43 GMT+0800 (China Standard Time)

ALB is very similar to an ELB it seems, I didn't have to set up any url mapping when I did it through the console. Unfortunately it is a different API so I'll have to modify feed to use it.

If you find a way to change nginx's behaviour please let me know, I spent a lot of time looking at config options and its source code, and it seems to do as I describe - closes socket immediately.

Apparently you can improve things by issuing a USR2 followed by WINCH, which will actually wait for connections to drain, but USR2 doesn't seem to work in containers.

Prashanth B · Answer 13 · Wed Dec 21 2016 01:53:04 GMT+0800 (China Standard Time)

Is ALB similar enough to ELB that we can have the service controller create it instead of the latter based off an annotation on the Service of Type=LoadBalancer?

…

On Tue, Dec 20, 2016 at 9:51 AM, James Ravn ***@***.***> wrote: ALB is very similar to an ELB it seems, I didn't have to set up any url mapping when I did it through the console. Unfortunately it is a different API so I'll have to modify feed to use it. If you find a way to change nginx's behaviour please let me know, I spent a lot of time looking at config options and its source code, and it seems to do as I describe - closes socket immediately. Apparently you can improve things by issuing a USR2 followed by WINCH, which will actually wait for connections to drain, but USR2 doesn't seem to work in containers. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#111 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKa-zMotgZMaAv2SwGO_Gr8ZRNQkYTBXks5rKBWwgaJpZM4LG6EJ> .

James Ravn · Answer 14 · Wed Dec 21 2016 02:13:08 GMT+0800 (China Standard Time)

I think the ALB is layer7 only, so you couldn't do TCP load balancing with it.

I only tested with layer7 ELB, so for me ALB was a drop in replacement.

Prashanth B · Answer 15 · Wed Dec 21 2016 02:17:09 GMT+0800 (China Standard Time)

Hmm, then maybe when you create an Ingress on AWS, if you don't specify `ingress.class=nginx`, you get an ALB with a catch all route pointing to an nginx, that does the actual proxying based on rules in the map. Ideally creating an Ingress on AWS should just work, no surprises, which apparently isn't the case right now since people are shoving a Service of Type=LB/ELB in front.

…

On Tue, Dec 20, 2016 at 10:13 AM, James Ravn ***@***.***> wrote: I think the ALB is layer7 only, so you couldn't do TCP load balancing with it. I only tested with layer7 ELB, so for me ALB was a drop in replacement. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#111 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKa-zBqRbAGs-0iaCpc64aiz_uPkU0Avks5rKBq1gaJpZM4LG6EJ> .

Justin Santa Barbara · Answer 16 · Thu Dec 22 2016 08:41:39 GMT+0800 (China Standard Time)

Ideally creating an Ingress on AWS should just work, no surprises, which apparently
isn't the case right now since people are shoving a Service of Type=LB/ELB in front.

What is the alternative?

Prashanth B · Answer 17 · Thu Dec 22 2016 10:41:50 GMT+0800 (China Standard Time)

ingress controller that spins up an ALB, like the gce one? You can also tier them, so you have an ALB -> nginx ingress. or maybe we can find a way to setup a catch all ALB instead of an ELB via annotations on the Service (not sure how similar they are in setup)

James Ravn · Answer 18 · Fri May 05 2017 20:57:58 GMT+0800 (China Standard Time)

Just an update:

We've pretty much given on nginx as an ingress proxy. It's very difficult to get seamless reloads in nginx - it will frequently drop requests in proportion to how often you reload. After much research and spiking, we've come to the conclusion haproxy is the superior, and only, viable solution for use as k8s ingress. We can reload many times a second without any dropped requests, and no noticeable performance problems - unlike nginx. With https://www.haproxy.com/blog/truly-seamless-reloads-with-haproxy-no-more-hacks/ we don't even need to hack around the small issues in haproxy reload.
ALBs are also out, because we discovered that ALBs don't support connection draining. That means there is no way to have a 0 downtime redeployment of the ingress component. It's a huge deficit in ALBs and we had to work with AWS before we got to the bottom of the problem. It's not documented in the AWS docs - they have told us they raised an internal ticket to update the docs. In the meantime, we will probably go back to ELBs. ELB+haproxy should give us truly 0% downtime in the face of even heavy ingress updates.

James Ravn · Answer 19 · Thu Jul 06 2017 17:47:53 GMT+0800 (China Standard Time)

Update #2 - as ingress doesn't reload very often, and the percentage of dropped requests is quite low, we haven't prioritised haproxy work. Otherwise the nginx proxy works great.