Getting 503 first time a workload is woken up

Question

Getting 503 first time a workload is woken up

jturpin82 opened this issue 2 years ago · comments

Thank you for that beautiful project. It's very useful to me!

I'm using a managed Kubernetes, GKE 1.21.5, traefik 2.5.6 (installed with helm chart) and using many workloads in the default namespace.

Using traefik-ondemand-plugin 1.2.0-beta.3 along with traefik-ondemand-service 1.7, it's quite straight forward to scale up and down (to zero) some small workloads (like nginx).

But there come a problem when trying to wake up bigger workloads (images around 400-500mb). I'm able to see the workload (pod) waking up, become ready, but end up with traefik getting 503 from the backend service (certainly because this is the actual behavior of using providers.kubernetesingress.allowEmptyServices=true). If I immediately refresh my page (or use a traefik plugin to handle the 503, like errorpages), I can access my web page.

But I would really want to avoid the 503 (since I would wake the workload also using API calls, and not only "web" calls).

My assumption is the following: When the workload is waking up, then an endpoint is generated and should finally become available to the kubernetes service. But in this case, Traefik is trying to access the kubernetes service while the endpoint is not totally ready (could be ready after some milliseconds). It's when the endpoint is "empty" on the following screenshot:

Traefik is getting the 503 when the endpoint is still at the empty stage.

Looking at the go code here, it seems like Traefik should consider the service up when the deployment is ready. Could we maybe consider checking the endpoint status? Or the service? Don't know if it could help or if it's really the root cause here...

Traefik is configured with the following options:

--experimental.plugins.traefik-ondemand-plugin.modulename=github.com/acouvreur/traefik-ondemand-plugin
--experimental.plugins.traefik-ondemand-plugin.version=v1.2.0-beta.3
--providers.kubernetesingress.allowEmptyServices=true

Here is how I annotated ingresses:
traefik.ingress.kubernetes.io/router.middlewares: default-ondemand-kfnqt8n476wgq28@kubernetescrd

Bascillty the traefik-ondemand-service is configured as described here: KUBERNETES.md

Tell me if you need me to provide more conf files.

And also, thank you for your help!

--

Alexis Couvreur · Answer 1 · Sun Jan 23 2022 07:05:30 GMT+0800 (China Standard Time)

I had the same behavior on a Docker Swarm setup. So it might be related to how quick the service is determined available/healthy.

Alexis Couvreur · Answer 2 · Sun Jan 23 2022 20:09:52 GMT+0800 (China Standard Time)

From the documentation https://doc.traefik.io/traefik/getting-started/faq/#502-bad-gateway

502 Bad Gateway

Traefik returns a 502 response code when an error happens while contacting the upstream service.

503 Service Unavailable

Traefik returns a 503 response code when a Router has been matched
but there are no servers ready to handle the request.

This situation is encountered when a service has been explicitly configured without servers,
or when a service has healthcheck enabled and all servers are unhealthy.

I think I got a "Bad Gateway" response in my case.

Jonathan TURPIN · Answer 3 · Sun Jan 23 2022 22:14:43 GMT+0800 (China Standard Time)

Maybe the 503 could be a consequence of allowEmptyServices, according to the doc?

Jonathan TURPIN · Answer 4 · Fri Apr 22 2022 15:18:30 GMT+0800 (China Standard Time)

Thanks @acouvreur I will test that on Kubernetes!

Alexis Couvreur · Answer 5 · Fri Apr 22 2022 15:31:53 GMT+0800 (China Standard Time)

My change is only related to Swarm. I tried to get some metadata that could help me consider a service healthy for more than 5 seconds, but couldn't.

If you do, please share it with me as I'll fix it right away.

Jonathan TURPIN · Answer 6 · Fri Apr 22 2022 15:48:42 GMT+0800 (China Standard Time)

Could the age (now - creationTimestamp) of the Endpoint be considered?

Example below with a simple nginx service:
kubectl create deploy nginx --image nginx
kubectl expose deploy nginx --port 80
kubectl get endpoints nginx -o=jsonpath='{.metadata.creationTimestamp}'

2022-04-22T07:43:32

Alexis Couvreur · Answer 7 · Fri Apr 22 2022 15:50:28 GMT+0800 (China Standard Time)

The creationTimestamp can be considered when there is no healthcheck. Bt when there's a healthcheck, is the creationTimestamp set to the first healthy check?

Jonathan TURPIN · Answer 8 · Fri Apr 22 2022 16:05:20 GMT+0800 (China Standard Time)

Yes you're right. creationTimestamp can not be considered as a reliable information on the service full availability.

I think the only way to be sure the service healthy is to check if endpoint is bound to at least 1 ip, like the following:

kubectl get endpoints nginx -o=jsonpath='{.subsets[*].addresses[*].ip}'

What do you think? Is this something that can be checked on your side? Thank you for your hard work on this plugin!

Alexis Couvreur · Answer 9 · Sat Apr 23 2022 19:45:10 GMT+0800 (China Standard Time)

That might be a good solution, I'll look into it

Jonathan TURPIN · Answer 10 · Sun Apr 24 2022 14:23:05 GMT+0800 (China Standard Time)

Thank you!

Alexis Couvreur · Answer 11 · Fri Apr 29 2022 16:22:57 GMT+0800 (China Standard Time)

I'm having an issue implementing it, endpoints can be created with a different name than the service/deployment right?
So if you have any lead on this... Right now I can't find a good solution for it.

Jonathan TURPIN · Answer 12 · Fri Apr 29 2022 16:47:33 GMT+0800 (China Standard Time)

The endpoint associated with a Service must always have the same name as the Service.
Kubernetes will automatically create an Endpoints object with the same name as the service

So maybe, if we want to consider a service fully available could we consider the ip from the endpoint with the same name.
Is that make sense?

Alexis Couvreur · Answer 13 · Fri Apr 29 2022 20:05:48 GMT+0800 (China Standard Time)

I'll create this feature as an experimental flag

Alexis Couvreur · Answer 14 · Sun May 08 2022 07:57:45 GMT+0800 (China Standard Time)

You can try it out on #29

Feedback welcomed

Jonathan TURPIN · Answer 15 · Wed May 11 2022 05:39:45 GMT+0800 (China Standard Time)

Sure. Let me test and get back to you!

Jonathan TURPIN · Answer 16 · Wed May 11 2022 21:03:31 GMT+0800 (China Standard Time)

Could you please add a complementary rule to the RBAC example, as below?:

  - apiGroups:
      - ""
    resources:
      - endpoints
    verbs:
      - get

Otherwise users will get a
level=error msg="endpoints "XXXXXXXXX" is forbidden: User "system:serviceaccount:default:traefik-ondemand-service" cannot get resource "endpoints" in API group "" in the namespace "default

Alexis Couvreur · Answer 17 · Wed May 11 2022 21:08:01 GMT+0800 (China Standard Time)

That makes sense.
Adding it resolves the issue ?

Jonathan TURPIN · Answer 18 · Wed May 11 2022 21:10:31 GMT+0800 (China Standard Time)

Just deployed on test env and waiting some more few days to see if this is fully working. Getting back soon...

Jonathan TURPIN · Answer 19 · Tue May 17 2022 04:47:19 GMT+0800 (China Standard Time)

Unfortunately, still getting some 503s the first time. Let me gather some piece of evidence

Jonathan TURPIN · Answer 20 · Tue May 17 2022 04:52:51 GMT+0800 (China Standard Time)

Image used:
ghcr.io/acouvreur/traefik-ondemand-service:fix-wait-for-k8s-endpoint-to-have-one-ip

tomaszduda23 · Answer 21 · Tue May 24 2022 02:44:29 GMT+0800 (China Standard Time)

I'm also getting the same error with ghcr.io/acouvreur/traefik-ondemand-service:fix-wait-for-k8s-endpoint-to-have-one-ip

Alexis Couvreur · Answer 22 · Tue May 24 2022 02:58:47 GMT+0800 (China Standard Time)

I'll take a look in a few days

Alexis Couvreur · Answer 23 · Fri Oct 14 2022 03:19:26 GMT+0800 (China Standard Time)

See acouvreur/sablier#62

acouvreur / traefik-ondemand-service

Getting 503 first time a workload is woken up

`502 Bad Gateway`

`503 Service Unavailable`