kubernetes-retired / contrib

[EOL] This is a place for various components in the Kubernetes ecosystem that aren't part of the Kubernetes core.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[ingress/controllers/nginx] Use Service Virtual IP instead of maintaining Pod list

edouardKaiser opened this issue · comments

Is there a way to tell the NGINX Ingress controller to use the Service Virtual IP Address instead of maintaining the Pods IP addresses in the upstream configuration ?

I couldn't find it. If not, I think it would be good. Because with the current situation, when we scale down a service. the Ingress controller does not work in harmony with the Replication Controller of the service.

That means, some requests to the Ingress Controller will fail while waiting for the Ingress Controller to be updated.

If we use the Service Virtual IP address, we can let kube-proxy do its job in harmony with the replication controller and we have a seamless down scaling.

I guess it has been implemented that way for session stickiness. But for applications who don't need that, it could be a good option.

when we scale down a service. the Ingress controller does not work in harmony with the Replication Controller of the service.

What do you mean?
After a change in the number of replicas in a rc it takes a couple of seconds to receive the update from the api server.

In the case of scaling down the number of replicas you need to tune the upstream check defaults docs

Besides this I'm testing the module lua-upstream-nginx-module to avoid reloads and be able to add/remove servers in an upstream

Ok, I'll try to explain with another example:

When you update a deployment resource (like changing the docker image), depending your configuration (rollingUpdate strategy, max surge, max unavailable), the deployment controller will bring down some pods, and create new one. All of this, in a fashion way where there is no downtime if you use the Service VIP to communicate with the pods.

Because first, when it wants to kill a pod, it removes the pod IP address from the service to avoid any new connection, and it follow the termination grace period of the pod to drain the existing connections. Meanwhile, it also creates a new pod, with the new docker image, and wait for the pod to be ready, and add the pod behind the service VIP.

By maintaining the pod list yourself in the Ingress Controller, at a certain point, during a deployment resource update, some requests will be redirected to pods which are shutting down. Because the Ingress Controller, does not know a RollingUpdate Deployment is happening. It will know maybe 1 second later. But for services, with a lots of connection/sec, it's potentially a lots of requests failing.

I personally don't want to tune the upstream to handle this scenario. Kubernetes is already doing an amazing job to update pods with no downtime. That only if you use the Service VIP.

Did I miss something ? If it's still not clear, or there is something I'm clearly not understanding, please don't hesitate.

The NGINX Ingress Controller (https://github.com/nginxinc/kubernetes-ingress) used to use the service VIP. But they changed recently to a system like yours (pod list in the upstream).

Before they changed this behaviour, I did some test. I was continuously spamming requests to the Ingress Controller (5/sec). Meanwhile, I updated the Deployment resource related to those requests (new docker images):

  • Kubernetes/Contrib: You can clearly see some requests failing at the time of the update
  • NGINX/Controller: It looks like nothing happened behind the scene, perfect deployment, with no downtime (all because of Kubernetes doing a great job behind the scene)

@edouardKaiser how are you testing this? The request are GET or POST? Can you provide a description of the testing scenario?

I personally don't want to tune the upstream to handle this scenario.

I understand that but your request is the contradicts what other users requested (control over the upstream checks). Is hard to find a balance in the configuration that satisfies all the user scenarios.

I understand some people might want to tweak the upstream configuration, but on the other side Kubernetes is doing a better job at managing deployment without downtime with the concept of communicating with pods through the Service VIP.

To reproduce, I just used the Chrome App Postman, and their Runner feature (you can specify some requests to run to a particular endpoint, with a number of iteration, delay....). And while the runner was running, I just updated the Deployment resource and watched the behaviour of the runner.

When the request is GET and it fails, Nginx automatically passes the request to the next server. But for non-idempotent method like POST, it does not (and I think it's the right behavior), and then we have failure.

But for non-idempotent method like POST, it does no

This is documented scenario https://github.com/kubernetes/contrib/tree/master/ingress/controllers/nginx#retries-in-no-idempotent-methods
NGINX changed this behavior in 1.9.13

Please add the option retry-non-idempotent=true in the nginx configmap to restore the old behavior

But it doesn't change the root of the problem: Ingress Controller and Deployment Controller don't work together.

Your pod might have accepted the connection and started to process it, but what the Ingress Controller does not know, is that this pod is gonna get killed the next second by the Deployment Controller.

  • For the Deployment Controller it's fine, the Deployment Controller did its job, removed the pod from the service and waited the termination grace period.
  • On the Ingress Controller: it's not fine, your connection will suddenly be aborted because the pod died. If you're lucky, the pod got removed by the Ingress Controller before any request gets in. If you're not, you'll experiment some failure. If it's a POST request, most of the time you really don't want to retry. If it's a GET request, but the pod gets killed in the middle of transferring a response, NGINX won't retry.

I know this is not a perfect world, and we need to embrace failure. Here, we have a way to potentially avoid that failure by using Service VIP.

I'm not saying it should be the default behaviour, but an option to use Service VIP instead of Pod endpoint would be awesome.

I'm with @edouardKaiser because:

  • You (as devops or operations guy) cannot guarantee that the final developer will follow the best practices regarding to keeping, dropping connections whenever it gets a SIGTERM for example. However, if services were used the responsibility for site reliability falls completely in concrete person/team.
  • Although retry-no-idempotent can mitigate some problems others could arise, choosing to retry those is not an option in much cases.

IMO the controller should expose a parameter or something to choose between final endpoints or services, that would cover all the use cases.

I couldn't have explained it better. Thanks @glerchundi

If you go through the service VIP you can't ever do session affinity. It
also incurs some overhead, such as conntrack entries for iptables DNAT. I
think this is not ideal.

To answer the questions about "coordination" this is what readiness and
grace period are for. What is supposed to happen is:

RC creates 5 pods A, B, C, D, E
all 5 pods become ready
endpoints controller adds all 5 pods to the Endpoints structure
ingress controller sees the Endpoints update
... serving ...
RC deletes 2 pods (scaling down)
pods D and E are marked unready
kubelet notifies pods D and E
endpoints controller sees readiness change, removes D and E from endpoints
ingress controller sees the Endpoints update and removes D and E
termination grace period ends
kubelet kills pods D and E

It is possible that your ingress controller falls so far behind that the
grace expires before it has a chance to remove endpoints, but that is the
nature of distrubited systems. It's equally possible that kube-proxy falls
behind - they literally use the same API.

On Mon, Jun 27, 2016 at 6:51 PM, Edouard Kaiser notifications@github.com
wrote:

I couldn't have explained it better. Thanks @glerchundi
https://github.com/glerchundi


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVLVBHdUQdCIRD6uVDhs8Icwg5k8tks5qQH4hgaJpZM4IuliB
.

I do understand this is not ideal for everyone, this is why I was talking about an option for this behaviour.

But I don't see how it is better for anyone? It buys you literally
nothing, but you lose the potential for affinity and incur performance hit
for no actual increase in stability.

On Mon, Jun 27, 2016 at 8:50 PM, Edouard Kaiser notifications@github.com
wrote:

I do understand this is not idea for everyone, this is why I was talking
about an option for this behaviour.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVEExdHf-NncMom53MS7914scnzJjks5qQJnrgaJpZM4IuliB
.

Correct me if I'm wrong, I probably misunderstood something in the termination of pods flow:

When scaling down, the pod is removed from endpoints list for service and, at the same time, a TERM signal is sent.

So, for me, at this exact moment, there is an opened window. Potentially, this pod (which is shutting down gracefully), might still get some requests forwarded by the nginx ingress-controller (just the time it needs for the ingress-controller to notice the changes, update and reload the conf).

On Mon, Jun 27, 2016 at 9:53 PM, Edouard Kaiser
notifications@github.com wrote:

Correct me if I'm wrong, I probably misunderstood something in the termination of pods flow:

When scaling down, the pod is removed from endpoints list for service and, at the same time, a TERM signal is sent.

Pedanticly, "at the same time" has no real meaning. It happens
asynchronously. It might happen before or after or at the same time.

So, for me, at this exact moment, there is an opened window. Potentially, this pod (which is shutting down gracefully), might still get some requests forwarded by the nginx ingress-controller (just the time it needs for the ingress-controller to notify the changes, update and reload the conf).

The pod can take as long as it needs to shut down. Typically
O(seconds) is sufficient time to finish or cleanly terminate open
connections and ensure no new connections arrive. So, for example,
you could request 1 minute grace, keep accepting connections for max(5
seconds since last connection, 30 seconds), drain any open
connections, and then terminate.

Note that the exact same thing can happen with the service VIP.
kube-proxy is just an API watcher. It could happen that kube-proxy
sees the pod delete after kubelet does, in which case it would still
be routing service VIP traffic to th epod that had already been
signalled. There's literally no difference. That's my main point :)

True, "at the same time" doesn't mean that much here, it's more like those operations are triggered in parallel.

I wanted to point out that possibility because I ran some tests before opening this issue (continuously sending requests to an endpoint backed by multiple-pods while scaling down). And when ingress-controller was using VIP, the down-scaling was happening more smoothly (no failure, no request passed to the next server by nginx), contrary to when the ingress-controller is maintaining the endpoint list (I noticed some requests were failing for that short time-window, and passed to the next server for the GET, PUT type...).

I'm surprised the same thing can happen with the service VIP. I supposed that Kubelet would start the shutdown only once the pod was removed from the iptable entries, but I was wrong.

So your point is, I got lucky during my tests, because depending the timing, I might have ended up with the same situation even with Service VIP.

On Mon, Jun 27, 2016 at 10:33 PM, Edouard Kaiser
notifications@github.com wrote:

I'm surprised the same thing can happen with the service VIP. I supposed that Kubelet would start the shutdown only once the pod was removed from the iptable entries, but I was wrong.

Nope. kube-proxy is replaceable, so we can't really couple except to the API.

So your point is, I got lucky during my tests, because depending the timing, I might have ended up with the same situation even with Service VIP.

I'd say you got UNlucky - it's always better to see the errors :)

If termination doesn't work as I described (roughly, I may get some
details wrong) we should figure out why

Thanks for the explanation Tim, I guess I can close this one.

Not to impose too much, but since this is a rather frequent topic, I wonder
if you want to write a doc or an example or something? A way to
demonstrate the end-to-end config for this? I've been meaning to do it,
but it means so much more when non-core-team people document stuff (less
bad assumptions :).

I'll send you a tshirt...

On Mon, Jun 27, 2016 at 11:29 PM, Edouard Kaiser notifications@github.com
wrote:

Thanks for the explanation Tim, I guess I can close this one.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVL1tS9yhMYvUh8MM6UKypMtCJRYEks5qQL9YgaJpZM4IuliB
.

Happy to write something.

Were you thinking about updating the README of the Ingress Controllers (https://github.com/kubernetes/contrib/tree/master/ingress/controllers/nginx)?

We could add a new paragraph about the choice of using endpoint list instead of service VIP (advantages like upstream tuning, session affinity..) and showing that there is no guarantee of synchronisation even by using the service VIP.

@thockin thanks for the explanation, it's very water clear now.

I'm glad I have a better understanding on how it works, it makes sense if you think about the kube-proxy as just an API watcher.

But to be honest, now I'm kind of stuck. Some of our applications don't handle very well the SIGTERM (no graceful stop..). Even if the application is in the middle of a request, a SIGTERM would shutdown the app immediately.

Using Kubernetes, I'm not sure how to deploy without downtime now. My initial understanding was this flow when scaling down/deploying new version:

  1. Remove the pod from the endpoint list
  2. Wait for the terminationGracePeriod (to wait for any request in progress to finish)
  3. Then shutdown with SIGTERM

We need to rethink about how to deploy or see if we can adapt our application to handle SIGTERM.

wrt writing something, I was thinking a doc or a blog post or even an
example with yaml and a README

On Mon, Jun 27, 2016 at 11:45 PM, Edouard Kaiser notifications@github.com
wrote:

Happy to write something.

Were you think about updating the README of the Ingress Controllers (
https://github.com/kubernetes/contrib/tree/master/ingress/controllers/nginx
)?

We could add a new paragraph about the choice of using endpoint list
instead of service VIP (advantages like upstream tuning, session
affinity..) and showing that there is no guarantee of synchronisation even
by using the service VIP.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVIOlLSwqcb-Gq7h7JcXUutoVQjmzks5qQMMCgaJpZM4IuliB
.

You also have labels at your disposal.

If you make your Service select app=myapp,active=true, then you can start
all your pods with that set of labels. When you want to do your own
termination, you can remove the active=true label from the pod, which
will update the Endpoints object, and that will stop sending traffic. Wait
however long you think you need, then delete the pod.

Or you could teach your apps to handle SIGTERM.

Or you could make an argument for a configurable signal rather than SIGTERM
(if you can make a good argument)

Or ... ? other ideas welcome

On Tue, Jun 28, 2016 at 4:47 AM, Edouard Kaiser notifications@github.com
wrote:

I'm glad I have a better understanding on how it works, it makes sense if
you think about the kube-proxy as just an API watcher.

But to be honest, now I'm kind of stuck. Some of our applications don't
handle very well the SIGTERM (no graceful stop..). Even if the application
is in the middle of a request, a SIGTERM would shutdown the app immediately.

Using Kubernetes, I'm not sure how to deploy without downtime now. My
initial understanding was this flow when scaling down/deploying new version:

  1. Remove the pod from the endpoint list
  2. Wait for the terminationGracePeriod (to wait for any request in
    progress to finish)
  3. Then shutdown with SIGTERM

We need to rethink about how to deploy or see if we can adapt our
application to handle SIGTERM.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVMo0zaGLSpVIPgp_E8TwYDsNX-Rcks5qQQnUgaJpZM4IuliB
.

Thanks for the advice, I tend to forget how powerful labels can be.

Regarding writing something, I can write a blog to explain why using and endpoint list is better. But I'm not sure what kind of example (YAML) you are talking about.

I guess there's not much YAML to write up. :) I just want to see something
that I can point the next person who asks this at and say "read this"

On Tue, Jun 28, 2016 at 8:48 PM, Edouard Kaiser notifications@github.com
wrote:

Thanks for the advice, I tend to forget how powerful labels can be.

Regarding writing something, I can write a blog to explain why using and
endpoint list is better. But I'm not sure what kind of example (YAML) you
are talking about.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVPjK5UBgJeyzfyR0VczUdDbpIw6Dks5qQesegaJpZM4IuliB
.

No worries Tim, I keep you posted.

Fantastic!!

On Tue, Jun 28, 2016 at 11:05 PM, Edouard Kaiser notifications@github.com
wrote:

No worries Tim, I keep you posted.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVDx-gvq8eGKsrHoOrAgMbN0R2WXZks5qQgsXgaJpZM4IuliB
.

I just created this blog entry:

http://onelineatatime.io/ingress-controller-forget-about-the-service/

I hope it will help some people. Feel free to tell me if there is anything wrong, anything that I could do to improve this entry.

Cheers,

Great post!!

Small nit:

1. Replication Controller deletes 1 pod
2. Pod is marked unready and shows up as Terminating
3. TERM signal is sent
4. Pod is removed from endpoints
5. Pod receives SIGKILL after grace period
6. Kube-proxy detects the change of the endpoints and update iptables

should probably be:

1. Replication Controller deletes 1 pod
2. Pod is marked as Terminating
3. Kubelet observes that change and sends SIGTERM
4. Endpoint controller observes the change and removes the pod from
Endpoints
5. Kube-proxy observes the Endpoints change and updates iptables
6. Pod receives SIGKILL after grace period

3 and 4 happen roughly in parallel. 3 and 5 are async to each other, so
it's just as likely that 5 happens first as vice-versa. Your Ingress
controller would be 4.1. Ingress controller observes the change and updates the proxy :)

On Wed, Jul 6, 2016 at 10:45 PM, Edouard Kaiser notifications@github.com
wrote:

I just created this blog entry:

http://onelineatatime.io/ingress-controller-forget-about-the-service/

I hope it will help some people. Feel free to tell me if there is anything
wrong, anything that I could do to improve this entry.

Cheers,


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVGMuCT6-ArM6-SodThc8wrsEFfSDks5qTJKRgaJpZM4IuliB
.

Thanks Tim, I will update it !

If you make your Service select app=myapp,active=true, then you can start all your pods with that set of labels. When you want to do your own termination, you can remove the active=true label from the pod, which will update the Endpoints object, and that will stop sending traffic. Wait
however long you think you need, then delete the pod.

I was wondering if the above approach could potentially be built into Kubernetes directly. The benefit I see is that people won't need to create custom routines which effectively bypass all standard tooling (e.g., kubectl scale / delete).

If labels aren't the right thing for this case, I could also think of a more low-levellish implementation: Introduce a new state called Deactivating that precedes Terminating and serves as a trigger for the Endpoint controller to remove a pod from rotation. After (yet another) configurable grace period, the state would switch to Terminating and cause kubelet to SIGTERM the pod as usual.

@thockin would that be something worth pursuing or rather be out of question?

I'm very wary of adding another way of doing the same thing as a core
feature. For the most part, graceful termination should do the right thing
for most people.

I could maybe see extending DeploymentStrategy to offer blue-green rather
than rolling, but that's not really this.

On Sat, Jul 9, 2016 at 5:53 AM, Timo Reimann notifications@github.com
wrote:

If you make your Service select app=myapp,active=true, then you can start
all your pods with that set of labels. When you want to do your own
termination, you can remove the active=true label from the pod, which
will update the Endpoints object, and that will stop sending traffic. Wait
however long you think you need, then delete the pod.

I was wondering if the above approach could potentially be built into
Kubernetes directly. The benefit I see is that people won't need to create
custom routines which effectively bypass all standard tooling (e.g.,
kubectl scale / delete).

If labels aren't the right thing for this case, I could also think of a
more low-levellish approach: Introduce a new state called Deactivating that
precedes Terminating and serves as a trigger for the Endpoint controller to
remove a pod from rotation. After (yet another) configurable grace period,
the state would switch to Terminating and cause kubelet to SIGTERM the pod
as usual.

@thockin https://github.com/thockin would that be something worth
pursuing or rather be out of question?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVABppSzvMyBPqIy2Xuyw38h0L3poks5qT5m0gaJpZM4IuliB
.

@thockin If I understand correctly, the way to allow for a non-interruptive transition using graceful termination is to have a SIGTERM handler that (in the most simplistic use case) just delays termination for a safe amount of time.

Is there a way to reuse such a handler across various applications, possibly through a sidecar container? Otherwise, I see the problem that the handler must be implemented and integrated for each and every application (at least per language/technology) over and over again. For third-party applications, it may even be impossible to embed a handler directly.

On Sun, Jul 10, 2016 at 3:01 PM, Timo Reimann notifications@github.com wrote:

@thockin If I understand correctly, the way to allow for a non-interruptive transition using graceful termination is to have a SIGTERM handler that (in the most simplistic use case) just delays termination for a safe amount of time.

There's no use to handle SIGTERM from a sidecar, if the main app dies
upon receiving it. It doesn't "just" delay - it notifies the app that
its end-of-life is near, and that it should wrap up and exit soon, or
otherwise be prepared.

Is there a way to reuse such a handler across various applications, possibly through a sidecar container? Otherwise, I see the problem that the handler must be implemented and integrated for each and every application (at least per language/technology) over and over again. For third-party applications, it may even be impossible to embed a handler directly.

The problem is that "handling" SIGTERM is really app-specific. Even
if you just catch it and ignore it, that's a decision we shouldn't
make for you.

Now, we have a proposal in flight for more more generalized
notifications, including HTTP, so maybe we can eventually say that,
rather than SIGTERM being hardcoded, that is merely the default
handler, but someone could override that. But that spec is not fully
formed yet, and I don't want to build on it just yet.

I'm open to ideas, but I don't see a clean way to handle this. Maybe
a pod-level field that says "don't send me SIGTERM, but pretend you
did" ? That's sort of ugly..

What I meant by delaying termination is that a custom SIGTERM handler could keep the container alive (i.e., time.Sleep(reasonablePeriod)) long enough until the point in time where it's safe to believe that the endpoint controller has taken the pod out of rotation so that requests won't hit an already dead pod. I don't think this is an ideal approach for reasons I have mentioned -- my assumption was that this is what you meant when you said "graceful termination [as a core feature] should do the right thing for most people". Maybe I misunderstood you; if so, I'd be glad for some clarification.

To repeat my intention: I'm looking for the best way to prevent request drops when scaling events / rolling-upgrades occur (as the OP described) without straying too far away from what standard tooling (namely kubectl) gives. My (naive) assessment is that the Kubernetes control plane is best suited for doing the necessary coordinative effort.

Do you possibly have any issue/PR numbers to share as far as that generalized notification proposal is concerned?

you should fail your readiness probe when you receive a sigterm. The nginx controller will health check endpoint readiness every 1s and avoid sending requests. Set termination grace to something high and keep nginx (or whatever webserver you're running in your endpoint pod) alive till existing connections drain. Is this enough? (I haven't read through previous conversation, so apologies if this was already rejected as a solution).

It sounds like what you're really asking for is to use the service vip in the nginx config and cut out the racecondition that springs from going: kubelt readiness -> apiserver -> endpoints -> kubeproxy, we've discussed various ways to achieve this (kubernetes/kubernetes#28442), but right now the easiest way is to health check endpoints from the ingress controller.

On Sun, Jul 10, 2016 at 3:57 PM, Timo Reimann notifications@github.com wrote:

What I meant by delaying termination is that a custom SIGTERM handler could keep the container alive (i.e., time.Sleep(reasonablePeriod)) long enough until the point in time where it's safe to believe that the endpoint controller has taken the pod out of rotation so that requests won't hit an already dead pod. I don't think this is an ideal approach for reasons I have mentioned -- my assumption was that this is what you meant when you said "graceful termination [as a core feature] should do the right thing for most people". Maybe I misunderstood you; if so, I'd be glad for some clarification.

I'm a little confused, I guess. What you're describing IS the core
functionality. When a pod is deleted, we notify it and wait at least
grace-period seconds before killing it. During that time window (30
seconds by default), the Pod is considered "terminating" and will be
removed from any Services by a controller (async). The Pod itself
only needs to catch the SIGTERM, and start failing any readiness
probes. Assuming nothing is totally borked in the cluster, the
load-balancers should stop sending traffic and the pod will be OK to
terminate within the grace period. This is, truthfully, a race and a
bit of wishful thinking. If something is borked in the cluster, it
is possible that load-balancers don't remove pods "in time" and when
the pod dies it kills live connections.

The alternative is that we never kill a pod while any service has the
pod in it's LB set. In the event of brokenness we trade a hung
rolling update for the above-described early termination. Checking
which Services a pod is in is hard, and we just don't do that today.
Besides that, it's an unbounded problem. Today it is Services, but it
is also Ingresses. But ingresses are somewhat opaque in this regard,
so we can't actually check. And there may be arbitrary other
frontends to a Pod. I don't think the problem is solvable this way.

So you're saying that waiting some amount of time is not "ideal", and
I am agreeing. But I think it is less bad than the other solutions.

The conversation turned to "but my app doesn't handle SIGTERM", to
which I proposed a hacky labels-based solution. It probably works,
but it is just mimicing graceful termination

To repeat my intention: I'm looking for the best way to prevent request drops when scaling events / rolling-upgrades occur (as the OP described) without straying too far away from what standard tooling (namely kubectl) gives. My (naive) assessment is that the Kubernetes control plane is best suited for doing the necessary coordinative effort.

Graceful termination. This is the kubernetes control plane doing the
coordinative effort. "wait some time" is never a satisfying answer,
but in practice it is often sufficient. In this case, the failures
that would cause "wait" to misbehave probably cause worse failures if
you try to close the loop.

Do you have any issue/PR numbers to share as far as that generalized notification proposal is concerned?

kubernetes/kubernetes#26884

@thockin @bprashanth sorry for not getting back on this one earlier. I did intend to follow up on your responses.

First, thanks for providing more details to the matter.

I'm fine with accepting the fact that graceful termination involves some timely behavior which also provides a means to set upper bounds in case things start to break. My concerns are more about the happy path and the circumstance that presumably a lot of applications running on Kubernetes will have no particular graceful termination needs but want the necessary coordination between the shutdown procedure and load-balancing adjustments to take place. As discussed, these applications need to go through the process of implementing a signal handler to switch off the readiness probe deliberately.

To add a bit of motivation on my end: We plan to migrate a number of applications to Kubernetes where the vast majority of them serves short-lived requests only and has no particular needs with regards to graceful termination. When we want to take instances down in our infrastructure, we just remove them from LB rotation and make sure in-flight requests are given enough time to finish. Moving to Kubernetes, we'll have to ask every application owner to implement and test a custom signal handler, and in the case of closed third-party applications resort to complicating workarounds with workflows/tooling separate from the existing standards. My impression is that this represents an undesirable coupling between the applications running on Kubernetes and an implementation detail on the load balancing routing part of the cluster manager.

That's why I think having a separate mechanism exclusively implemented in the control plane could contribute to simplifying running applications on Kubernetes by removing some of the lifecycle management boilerplate. To elaborate a bit on my previous idea: Instead of making each application fail its readiness probe, make Kubernetes do that "externally" and forcefully once it has decided to take a pod down, and add another grace period (possibly specified with the readiness probe) to give the system sufficient time for the change in readiness to propagate. This way, custom signal handlers for the purpose of graceful termination become limited in scope to just that: Giving an application the opportunity to execute any application-specific logic necessary to shut down cleanly, while all the load balancing coordination happens outside and up front. I'm naively hopeful that by reusing existing primitives of Kubernetes like readiness probes, timeouts, and adjustments to load-balancing, we can avoid dealing with the kind of hard problems that you have mentioned (checking which services a pod is in, unbound number of service frontends).

I'm wondering if it might be helpful to open a separate proposal issue and discuss some of these things in more detail. Please let me know if you think it's worthwhile carrying on.

Thanks.

Sorry for pinging an old tread but i'm struggling a tad to find concrete answers in the core docs on this and this is the best thread i've found so far which explains what's going on... so can I clarify a few things in relation to termination? If this should be posted somewhere else, please let me know.

A: The termination cycle:

  1. Before a pod is terminated, it enters a "terminating" phase that lasts the duration of a configurable grace period (which by default is 30 seconds). Once this period concludes, it enters a "terminated" phase while resources are cleaned up and is then eventually deleted (or does it just get deleted after the terminating phase?).
  2. As soon as the pod enters the "terminating" phase, each container is sent a SIGTERM or a custom command (if the preStop lifecycle hook is configured on a container) and at the same time the pod immediately automatically advertises an "unready" state.
  3. The rules in 1 and 2 are followed regardless of the reason for termination i.e. if the pod is exceeding max memory / cpu usage, node is OOM etc. In no case will the Pod be SIGKILL'd without first entering the "terminating" phase with the grace period.
  4. Services, ingress etc. will see the change of the pod to the "unready" state via a subscription to the state store and start removing the pod from their own configurations. Because this entire process is async, the pod may still get some additional traffic for a few seconds after it's received a SIGTERM.
  5. An ingress and / or service will not by default (and cannot otherwise be configured to) retry its request with another pod if it receives some kind of pod is terminating (or other) status code or response.

B: Handling the termination

  • If you want 100% throughput, with no dropped traffic it is not recommended that containers actually terminate themselves after being sent a SIGTERM command. Instead, they should clean up what they can and stick around to handle any remaining requests that may trickle through until the grace period expires or at least for the period of time that you guess it would take for services / ingress etc. to update their configurations.
  • If you're happy to drop some traffic, you can instead actually terminate your processes / containers when you get your SIGTERM signal or have them start failing their liveness probes. Even if marked as "essential", containers will not be restarted if the pod is in a terminating phase. If all containers stop prior to the grace period expiring, the pod immediately enters the "terminated" phase.

@ababkov From what I can see, your description is pretty much correct. (Others are needed to fully judge though.)

Two remarks:

Re: A.3.: I'd expect an OOM'ing container to receive a SIGKILL right away in the conventional Linux sense. For sure it exits with a 137 code, which traditionally represents a fatal error yielding signal n where n = 137 - 128 = 9 = SIGKILL.
Re: A.5.: Ingress sends traffic to the endpoints directly, which means that in principle it's possible for it to retry requests to other pods. Whether that's an actual default or something that can be configured depends on the concrete Ingress controller used. AFAIK, both Nginx and Traefik support retrying.
As far as Services are concerned, you need to distinguish between the two available modes: In userspace mode, requests can be retried. In iptables mode (the current default), they cannot. (There's a ticket to try out IPVS as a third option which, as far as I understood, would bring the benefits of the two modes together: high performance while also being able to retry requests.)

Here's a recommendable read to better understand Ingresses and Services: http://containerops.org/2017/01/30/kubernetes-services-and-ingress-under-x-ray/

Thanks very much for your reply @timoreimann - re A.5 will watch the IPVS item, also the post you linked is really good - helped me to get a better understanding of kube-proxy - had i not spent days gradually coming to the realisation how services and ingress work it probably would have helped for that as well.

Re A.3 - Is your explanation based on a pod that's gone over its memory allocation or a node that is out of memory and killing pods so it can continue running? An immediate sigkill might be a little frustrating if you're trying to ensure your apps have a healthy shut down phase.

If i could get a few more opinions on my post above from one or two others and / or some links to relevant documentation where you think these scenarios are covered in detail (understanding I have done quite a lot of research before coming here) that would be great.

I know I can just experiment with this myself and "see what happens", but if there's someway to shortcut that and learn from others and / or the core team, that would be awesome.

@ababkov re: A.3: Both: There's an OOM killer for the global case (node running out of memory) and the local one (memory cgroup exceeding its limit). See also this comment by @thockin on a Docker issue.

I think that if you run into a situation where the OOM killer has selected a target process, it's already too late for a graceful termination: After all, the (global or local) system took this step in order to avoid failure on a greater scale. If "memory-offending" processes were given an additional opportunity to lengthen this measure arbitrarily, the consequences could be far worse.
I'd argue that it's better to prevent an OOM situation from occurring in the first place by means of monitoring/alerting on memory consumption continuously such that you still have enough time to react.

While doing a bit of googling, I ran across kubelet soft eviction thresholds. With these, you can define an upper threshold and tell kubelet to shut down pods gracefully in time before a hard limit is reached. From what I can tell though, the eviction policies operate on a node level, so it won't help in the case where a single pod exceeds its memory limit.

Again, chances are there's something I'm missing. Since this issue has been closed quite a while ago, my suggestion to hear a few more voices would be to create a new issue / mailing list thread / StackOverflow question. I'd be curious to hear what others have to say, so if you decide to follow my advice please leave a reference behind. :-)

@timoreimann the soft eviction thresholds add another piece to the puzzle - thanks for linking.

Happy to open another topic - i'm still new to the project but presuming i'd open this in this repo in particular?

Topic would be along the lines of trying to get some clarity in place around the nature of the termination lifecycle in every possible scenario that a pod can be terminated.

@ababkov I'd say that if the final goal is to contribute the information you will gain back to Kubernetes project (supposedly in form of better documentation), an issue in the main kubernetes/kubernetes repo seems in order.

OTOH, if this is "only" about getting your questions answered, StackOverflow is probably the better place to ask.

Up to you. :-)

@timoreimann more than happy to contribute - thanks for your help mate.

It is very odd to me that in order to not drop any traffic, SIGTERM should not actually terminate my app, but instead let it hang around for a bit? (Until Ingress updates its configurations). If I wanted actual 100% uptime during this time period, it's not possible with the default k8s? I really would rather not drop traffic if I can help it, and testing with ab definitely shows 502s with an nginx ingress controller.

I think this kind of issue should be prioritized. Otherwise I can try something like that label-based solution mentioned earlier, but then it feels like re-inventing the wheel and seems quite complex.

@thockin What I am saying is that after SIGTERM is sent, the Ingress still sends traffic to my dying pods for a few seconds, which then causes the end users to see 502 Gateway errors (using for example an Nginx ingress controller). A few people in this thread have mentioned something similar. I don't know of any workarounds, or how to implement that "labels" hack mentioned earlier.

How do I get a zero-downtime deploy?

Step 3. Kubelet sends SIGTERM to the Pod. Let's say the pod is running Gunicorn, which, upon receiving a SIGTERM, stops receiving new connections, and gracefully stops; finishing its current requests up until a 30-second timeout.

During this time, since all of 3 is async, Ingress is still sending some traffic to this Gunicorn, which is now refusing all connections. The nginx Ingress controller I am using then returns a 502 Bad Gateway.

So then, what is the point of the SIGTERM? My preStop could literally just have a bash -c sleep 60?

I'm not sure how to make it so that gunicorn basically ignores the SIGTERM. I set up a hack with a readinessProbe that just tries to cat a file, and in the preStop, I delete the file and sleep 20 seconds. I can see that the pod sticks around for a while, but I still drop some requests. Should I sleep longer? Would it have to do with an Http Keep-Alive? I'm kind of lost as to the best way to implement this.

I added a simple preStop hook:

        lifecycle:
          preStop:
            exec:
              # A hack to try to get 0% downtime during deploys. This should 
              # help ensure k8s eventually stops giving this node traffic.
              command: ["sh", "-c", "sleep 75"]

I watch my pods when I do an apply and deploy a new container version. Since I have maxUnavailable: 0 my pod doesn't start terminating until another pod is ready. Then, it starts terminating and the countdown timer above starts. Around this time is when I get the 502s:

192.168.64.1 - - [19/Mar/2017:12:54:56 +0000] "POST /healthz/ HTTP/1.0" 200 8 "-" "ApacheBench/2.3" "-"
2017/03/19 12:54:56 [notice] 1321#1321: signal process started
2017/03/19 12:54:56 [error] 1322#1322: *127179 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.64.1, server: vm.example.org, request: "POST /healthz/ HTTP/1.0", upstream: "http://172.17.0.4:8000/healthz/", host: "vm.example.org"
192.168.64.1 - - [19/Mar/2017:12:54:56 +0000] "POST /healthz/ HTTP/1.0" 502 173 "-" "ApacheBench/2.3" "-"
192.168.64.1 - - [19/Mar/2017:12:54:56 +0000] "POST /healthz/ HTTP/1.0" 200 8 "-" "ApacheBench/2.3" "-"

How come I get a connection refused? My container didn't get a sigterm until 75 seconds later (I saw that being logged), which was way after these 502s.

By the way, I think I figured out how to do this; thanks for your help in showing me the rough sequence of events. I think what was happening is that between the new container coming up and it being ready to accept requests, a small amount of time passes, and nginx was directing requests to the new container too early. I added a hack. My readinessProbe now looks like:

        readinessProbe:
          exec:
            command:
            - cat
            - /opt/myapp/test_requirements.txt
          successThreshold: 1
          failureThreshold: 2
          periodSeconds: 5
          initialDelaySeconds: 5   # Important so that the container doesn't get traffic too soon

and the preStop looks like:

        lifecycle:
          preStop:
            exec:
              # A hack to try to get 0% downtime during deploys. This should 
              # help ensure k8s eventually stops giving this node traffic.
              command: ["sh", "-c", "rm /opt/myapp/test_requirements.txt && sleep 20"]

This can probably be tweaked a bit; for example, I should be using a real readinessProbe that hits my API, and then on the preStop, run a script to tell my app to start failing the probe, rather than deleting a file. The sleep is required too, so that the old container doesn't die too quickly before the Ingress stops directing traffic to it. The numbers are probably too high but this works, I ran a bunch of tests, and no dropped requests or "connection refused" logs in nginx, which means I think it's doing the right thing. Thanks!

What's wrong with the current process:
The main issue is that SIGTERM basically means... finish what you're doing, don't start any new work and gracefully exit.

At the moment it's instead, finish what you're doing, continue accepting and processing new work for 0 - 10 seconds? ... or however long the load balancers take to update... and somehow work out when an appropriate time might be for you to gracefully exit before you're sent a SIGKILL.

I think the ideal situation here would be to:

  1. have each load balancing service register itself as listening to a container.
  2. modify the termination flow to first put the containers into an unready state.
  3. after each load balancing service sees the unready state of the containers and removes the containers from their pool, they should also remove their "i'm listening to this container" marker.
  4. When all markers are gone (or after some max configurable amount of time), continue with the termination cycle as per usual.

All of that being said, the above is probably overly intricate and not a great idea...
Perhaps the best we can do is:

  1. Educate people better about how this works in the documentation (it really needs a page that diagrams all the different ways a container can be terminated, how termination state flows and potential repercussions).
  2. Add something to the config to allow for the equivalent of what @domino14 is doing above to delay the sigterm + subsequent sigkill commands by a load balancer grace period. Eg. go into an immediate unready state 10 seconds before attempted SIGTERM and then continue the lifecycle as expected.

@ababkov I think your proposed ideal is very similar to what I described above a few months ago. Effectively, it's request draining, and I believe it's not too uncommon to have that with other networking components. I'd still think that an implementation could primarily reuse already existing concepts (state transitions/observations, LB rotation changes, timers, etc.) and be fairly loosely coupled. It would be "yet another waiting layer" on top (or in place) of the existing SIGTERM logic.

We are currently transitioning 100+ services over to Kubernetes. Having to ask every service owner to provide the kind of SIGTERM procedure needed by Kubernetes is quite an effort, especially for 3rd party software which we need to wrap in custom scripts running pre-stop hooks.

At this stage, opening a separate feature request to discuss the matter seems worthwhile to me.

@timoreimann I agree though I'd ask that you take the lead on that given that you're in the process of transitioning and can probably more clearly explain the use case. I fully understand your current situation (we're about to be in the same position) and am happy to contribute.

The only thing i'm not fully sure of from a proposal standpoint is whether the feature should / can try and address anything beyond a preliminary "draining" status that occurs for a fixed configurable period of time. The alternative of course being implementing a solution where things that route traffic to the container register themselves as listeners (via state updates) and acknowledge (via state updates) when a container had gone into draining status.... once all acknowledgements have been logged, container transitions to terminating status and everything proceeds per usual.

@timoreimann we have the same issue, we have to ask every service owner to implement a proper SIGTERM handler to make sure deployment is transparent to the users.

It's true that it would make things easier if the pod was just flagged as not-ready anymore. Give time to remove it behind the service, draining requests, and then SIGTERM....

@ababkov I'd be happy to be the one who files the feature request.

My understanding is that any solution requiring some means of extended coordination will be much harder to push through. @thockin and other members of the community have expressed in the past that too much coupling is rather dangerous in a distributed system like Kubernetes, and it's the reason why Kubernetes was designed differently. Personally, I think that does make a lot of sense.

I think we will have time to delve into these and other implementation details on the new ticket. Will drop pointers here once I can find some time to file it (after I made sure no one else had put forward a similar request yet).

Thanks for your input!

@timoreimann based on that i'd suggest that the request be logged as the simpler alternative of allowing a configurable amount of time where the container sits in an unready / draining state prior to the remainder of the termination flow taking place (maybe in reference to this conversation).

That alone would make everything 10 times clearer and more straight forward.

I think at the very least a few notes should be added in the documentation, as it's not clear that a special set of procedures is needed to get actual zero downtime.

@ababkov sounds good to me!

Yep, I did see the post and it was helpful in showing me what was going on behind the scenes, but it's still a bit tough to go from that to knowing we need delays in several places, etc, to get zero downtime deploys. That's why I suggested notes in the documentation.

In any case I'm a newcomer to K8S and I appreciate greatly the work done here. I'd submit a documentation patch if I knew the terminology better.

With regards to the original issue, could it be related to differences in connection strategies between the nginx ingress and kube-proxy. At least for the userspace kube-proxy it has a retry if dialling an endpoint fails https://sourcegraph.com/github.com/kubernetes/kubernetes@v1.6.1/-/blob/pkg/proxy/userspace/proxysocket.go#L94:1-95:1
I'm not sure if nginx ingress or the other kube-proxy modes have a similar strategy.

When gunicorn receives a SIGTERM, does it stop listening on the socket but continue to drain the open requests? In that case it should gracefully drain with kube-proxy userspace since kube-proxy will move on to the next endpoint if gunicorn does not accept the connection.

Even if servers like (g)unicorn are doing the wrong thing theres the no guarantee which is faster a) endpoint update b) SIGTERM on process. I think the endpoint update should be sync and only after its finished we continue with SIGTERM (as discussed above). Any other solution has this timing problem.

Some notes about our testing and workarounds for now:

  • You can workaround this by using the sleep hack in case of services. Magic sleeps are never really "real" patterns, hacks at best
  • This problem is mitigated by nginx for GETs and such because it retries 50x.
    -Theres another issue in nginx where full upstream flip (eg. pool [a,b] -> [c,d) causes the proxy_next_upstream to not work. Testable with eg. rollingupdate, replicas 3, maxSurge 25%, maxUnavailable 50% where it ends up splitting from 4 pods (2 old in pool, 2 new getting ready) to 2 new in pool
  • I haven't been able to reproduce this on GLBC. Any ideas what it does differently since simply using a service vip doesn't fix it. Does it do the same as proxy_next_upstream but better ?

Theres another issue in nginx where full upstream flip (eg. pool [a,b] -> [c,d) causes the proxy_next_upstream to not work. Testable with eg. rollingupdate, replicas 3, maxSurge 25%, maxUnavailable 50% where it ends up splitting from 4 pods (2 old in pool, 2 new getting ready) to 2 new in pool

The focus after the next nginx controller release (0.9) will be avoid nginx reloads for upstream updates.

Just ran into this issue - at the very least this should be better documented, I'm sure the vast majority of people using rolling updates believe that kubernetes waits until a pod is out of rotation before sending SIGTERM, and this is not the case!

I also think that even if not implemented right away, sequencing SIGTERM to happen after a pod has been taken out of rotation should be the eventual goal: I don't think saying "gunicorn is doing the wrong thing" is a useful response, that's just kubernetes redefining what it thinks these signals mean, gunicorn and other servers behave in a fairly standard way wrt signals, and kubernetes should aim to be compatible with them rather than offloading the problem onto users.

@thockin you mentioned in an earlier comment:

During that time window (30 seconds by default), the Pod is considered “terminating” and will be removed from any Services by a controller (async). The Pod itself only needs to catch the SIGTERM, and start failing any readiness probes.

Why do you need to start failing the readiness probe?

When a Pod is marked as Terminating, the Endpoint controller observes the change and removes the pod from Endpoints.

Since this is already done as a result of a terminating pod, why do we also need to start failing the readiness probe, which (after it fails a certain amount of times) will also result in the prod being removed as the endpoint of the service.

@simonkotwicz you do not need to start failing the readiness probe (these days anymore?). I'm not sure if it used to be like that, but it's certainly not the case these days. Things work fine without that part.