Liveness check doesn't work.

Question

Liveness check doesn't work.

monaka opened this issue 8 years ago · comments

My router was hang up over one night by TLS timeout error, Log is like below.
This error is caused by K8s issue, not a bug by Deis Workflow.
But it should be restarted by liveness check. In my case, it was not restarted.

2016-10-05 18:01:16.007365 I | INFO: Router configuration has changed in k8s.
2016-10-05 18:01:16.197147 I | INFO: Reloading nginx...
2016-10-05 18:01:16.226493 I | INFO: nginx reloaded.
2016-10-05 18:02:01.031902 I | INFO: Router configuration has changed in k8s.
2016-10-05 18:02:01.038206 I | INFO: Reloading nginx...
2016-10-05 18:02:01.042403 I | INFO: nginx reloaded.
2016-10-05 18:02:08.770870 I | INFO: Router configuration has changed in k8s.
2016-10-05 18:02:08.774971 I | INFO: Reloading nginx...
2016-10-05 18:02:08.778449 I | INFO: nginx reloaded.
[2016-10-05T21:41:31+00:00] -  - 172.16.67.0 - - - 400 - "\x05\x02\x00\x02" - 325 - "-" - "-" - "_" - - - - - - - 5.075
2016-10-05 21:54:44.669198 I | Error building model; not modifying certs or configuration: Get https://172.17.0.1:443/api/v1/namespaces/{censored}/endpoints/{censored}: net/http: TLS handshake timeout.
2016-10-05 21:54:54.681336 I | Error building model; not modifying certs or configuration: Get https://172.17.0.1:443/apis/extensions/v1beta1/namespaces/deis/deployments/deis-router: net/http: TLS handshake timeout.
2016-10-05 21:55:04.918658 I | Error building model; not modifying certs or configuration: Get https://172.17.0.1:443/apis/extensions/v1beta1/namespaces/deis/deployments/deis-router: net/http: TLS handshake timeout.
2016-10-05 21:55:14.947594 I | Error building model; not modifying certs or configuration: Get https://172.17.0.1:443/apis/extensions/v1beta1/namespaces/deis/deployments/deis-router: net/http: TLS handshake timeout.
2016-10-05 21:55:24.959532 I | Error building model; not modifying certs or configuration: Get https://172.17.0.1:443/apis/extensions/v1beta1/namespaces/deis/deployments/deis-router: net/http: TLS handshake timeout.
2016-10-05 21:55:51.639615 I | Error building model; not modifying certs or configuration: Get https://172.17.0.1:443/apis/extensions/v1beta1/namespaces/deis/deployments/deis-router: net/http: TLS handshake timeout.

Kent Rancourt · Answer 1 · Thu Oct 06 2016 12:08:35 GMT+0800 (China Standard Time)

This would not have prevented the router from serving requests, so it actually was still alive.

Masaki Muranaka · Answer 2 · Thu Oct 06 2016 13:04:36 GMT+0800 (China Standard Time)

@krancour I also think so. I'm not sure (because I restarted my router without enough inspection) but the router could managed to return liveness check. But new applications couldn't be routed as Nginx fails to reload its conf.
It is an unexpected/inconsistent condition. I think it should be restarted then.

Masaki Muranaka · Answer 3 · Thu Oct 06 2016 13:12:53 GMT+0800 (China Standard Time)

(It might be enough just to retry reloading. Not restarting.)

Kent Rancourt · Answer 4 · Thu Oct 06 2016 23:27:26 GMT+0800 (China Standard Time)

I understand the impulse here... you see something alarming in the logs, assume the worst, and wish the router would be restarted in order to rectify it. I'll try to explain why we don't want that behavior.

You see an error every ten seconds and that error occurs while attempting to build the router's configuration model from various k8s resources that the router watches. To be clear, this model is built every ten seconds regardless of whether there are changes to those underlying resources or not. (For the record, this is something that could probably be improved in the future.) Once the model is built, it's deep compared to the previous model, which is stored in memory. If there are differences, then Nginx configuration is reloaded. (Note the model is built before the comparison so that inconsequential changes to the underlying k8s resources [ones not affecting the final model] will not prompt Nginx to reload.)

You are correct that while this problem persists, new applications cannot be routed (nor would any other configuration changes be applied). The thing I want to emphasize, however, is keeping my explanation above in mind, these errors aren't a solid indication that there was any config change that was being missed. That could be the case, but since the underlying k8s resources couldn't be fetched, we have no way of knowing.

So in this scenario, we don't know how profoundly the router's configuration differs from what it is supposed to be. It could be off by a lot... or not at all. What we do know, however, is that the router is still serving requests.

Moving on, consider now the fact that (as you indicated) the problem here is with k8s; not with the router. This means restarting the router might not (more than likely won't) rectify the problem. (I'm not sure if it did in your case or not.) If restarting does not solve the problem, then you're left with a much worse problem-- the router will be unable to do the initial build of its model upon restart and then it will not route any traffic at all.

So here's what it boils down do...

Restarting the router automatically in response to a problem of indeterminate severity seems like a dangerous course of action, especially considering that doing so could make the problem much, much worse.

Restarting might be a wise thing to do here, but it shouldn't be done automatically. It's a decision that should be made by an operator after assessing why communication with the apiserver is failing.

Masaki Muranaka · Answer 5 · Fri Oct 07 2016 06:13:49 GMT+0800 (China Standard Time)

The validity of restarting depends on "TLS errors are temporary or not". Indeed restarting may cause worse result. So restarting may bad idea.

But "retry to reload" is low risk and effective, right? I guess Router can detect reload failure.

Kent Rancourt · Answer 6 · Sat Oct 08 2016 00:11:42 GMT+0800 (China Standard Time)

@monaka can you clarify what you mean by "retry to reload"?

Matthew Fisher · Answer 7 · Tue Nov 01 2016 01:01:35 GMT+0800 (China Standard Time)

closing as intentional. This isn't something we're going to fix as this is how we intend to operate given the circumstances.