sky-uk / feed

Nginx based Kubernetes ingress controller for AWS

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Consider upstreaming

bprashanth opened this issue · comments

Since this is based on nginx, and we have an nginx backend (https://github.com/kubernetes/ingress/tree/master/controllers/nginx), and we're working on a generic controller with pluggable backends (https://github.com/kubernetes/ingress/blob/master/core/pkg/ingress/doc.go#L19), it seems like a good way to collaborate, for greater justice.

Even if you don't feel like it's worth your time right now, it might make sense to streamline how feed and the standard nginx backend handle nginx specific features.

@balooo: Hello and thank you for your time today :-)

This sounds good to me. I'll talk to the team and see what they think. We are going to be prod live very soon so we'll have to stick with this project for at least the short term. We still have some work planned for feed that we need to do very soon (moving to client-go, using endpoints instead of services), after that we can look into switching to upstream.

fyi the upstream interface should give you something like a snapshot of urls+services+endpoints from its local cache on each update notification from apiserver, so you can pick if you want to go straight to svc vip, svc dns, svc nodeport or endpoints

@bprashanth We're probably going to back away from adding endpoints to nginx. The problem is nginx reload is quite problematic when done frequently (such as would happen with endpoint updates):

  • When keepalive connections are involved, clients will get stale connections a lot - as nginx simply closes any connections it considers idle on a reload/graceful stop, without waiting for the keepalive_timeout.
  • Memory usage can balloon very quickly under load as old nginx workers will stick around after a reload.

So I think using endpoints with nginx is going to be a bad idea. The problem of persistent connections + endpoint updates needs to be solved at the kube-proxy layer ideally.

I'd expect both sides to follow the tcp protocol? As long as nginx sends a fin and does the normal close handshake, the client should tear down the connection, keepalive or not, unless there's a stateful NAT inbetween that disagrees with tcp timeouts?

Maybe you can describe what you see on a tcpdump in a controlled environment (or is this too hard to hit in testing)?

We're using an ELB in front of nginx, and we'll see ELB 504s getting returned while under load when nginx does a reload. It isn't very many though, because ingress updates are pretty rare. As the ELB is a black box I can't see what it's doing.

I will try to duplicate this and capture logs to understand it better.

I think something else may be going wrong, so I'll investigate further. Could be an ELB bug. I think somewhere in the ELB docs it says the ELB needs to close the backend connections. Akamai has a similar restriction.

@bprashanth tcpdump shows the nginx host sending tons of tcp resets to the ELB when a reload occurs. These translate directly into 504s returned from the ELB. I can reproduce as well hitting nginx directly with wrk using http/1.1 keepalives, I get connection resets. It seems so far nginx reload isn't as graceful as one would hope. I would expect it to close its connections gracefully, but so far it seems nginx simply closes the socket when idle rather than wait for the connection to terminate on the ELB/client side.

We send a HUP to the master process. I'm pretty sure this is all expected, http 1.1 rfc 2616 says clients should be able to handle async closes (non synchronous / graceful). Otherwise nginx workers would have to wait for some grace period to be sure the connections close.

I spoke with AWS support and they suggested I use the new application load balancers instead. I tried one out, and indeed it works fine with nginx reloads - not a single 504. Apparently the ALBs don't reuse persistent connections, instead it looks like they batch multiple requests through a connection then close it. That seems to solve the problem for us - so with ALB I can start updating endpoints with confidence :).

RST is still kind of rude, I get the feeling there's a more obvious fin/fin-wait-timeout option but I'd have to dig up nginx config a bit. Great that you have it working. Unfortunately Service.Type=LoadBalancer doesn't setup an ALB, so maybe this is how we should finally manifest Ingress on AWS?

Right now it's an ELB created by Service.Type=LB over the nginx ingress controller. Is setting up an ALB just like setting up an ELB + one last step to create something like a url map? or are all the resources in the pipeline completely different (as is the case on GCE)

ALB is very similar to an ELB it seems, I didn't have to set up any url mapping when I did it through the console. Unfortunately it is a different API so I'll have to modify feed to use it.

If you find a way to change nginx's behaviour please let me know, I spent a lot of time looking at config options and its source code, and it seems to do as I describe - closes socket immediately.

Apparently you can improve things by issuing a USR2 followed by WINCH, which will actually wait for connections to drain, but USR2 doesn't seem to work in containers.

I think the ALB is layer7 only, so you couldn't do TCP load balancing with it.

I only tested with layer7 ELB, so for me ALB was a drop in replacement.

Ideally creating an Ingress on AWS should just work, no surprises, which apparently
isn't the case right now since people are shoving a Service of Type=LB/ELB in front.

What is the alternative?

ingress controller that spins up an ALB, like the gce one? You can also tier them, so you have an ALB -> nginx ingress. or maybe we can find a way to setup a catch all ALB instead of an ELB via annotations on the Service (not sure how similar they are in setup)

Just an update:

  1. We've pretty much given on nginx as an ingress proxy. It's very difficult to get seamless reloads in nginx - it will frequently drop requests in proportion to how often you reload. After much research and spiking, we've come to the conclusion haproxy is the superior, and only, viable solution for use as k8s ingress. We can reload many times a second without any dropped requests, and no noticeable performance problems - unlike nginx. With https://www.haproxy.com/blog/truly-seamless-reloads-with-haproxy-no-more-hacks/ we don't even need to hack around the small issues in haproxy reload.
  2. ALBs are also out, because we discovered that ALBs don't support connection draining. That means there is no way to have a 0 downtime redeployment of the ingress component. It's a huge deficit in ALBs and we had to work with AWS before we got to the bottom of the problem. It's not documented in the AWS docs - they have told us they raised an internal ticket to update the docs. In the meantime, we will probably go back to ELBs. ELB+haproxy should give us truly 0% downtime in the face of even heavy ingress updates.

Update #2 - as ingress doesn't reload very often, and the percentage of dropped requests is quite low, we haven't prioritised haproxy work. Otherwise the nginx proxy works great.