acouvreur / traefik-ondemand-service

Traefik ondemand service for the traefik ondemand plugin

Home Page:https://pilot.traefik.io/plugins/605afbdba5f67ab9a1b0e53a/containers-on-demand

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Getting 503 first time a workload is woken up

jturpin82 opened this issue · comments

Thank you for that beautiful project. It's very useful to me!

I'm using a managed Kubernetes, GKE 1.21.5, traefik 2.5.6 (installed with helm chart) and using many workloads in the default namespace.

Using traefik-ondemand-plugin 1.2.0-beta.3 along with traefik-ondemand-service 1.7, it's quite straight forward to scale up and down (to zero) some small workloads (like nginx).

But there come a problem when trying to wake up bigger workloads (images around 400-500mb). I'm able to see the workload (pod) waking up, become ready, but end up with traefik getting 503 from the backend service (certainly because this is the actual behavior of using providers.kubernetesingress.allowEmptyServices=true). If I immediately refresh my page (or use a traefik plugin to handle the 503, like errorpages), I can access my web page.

But I would really want to avoid the 503 (since I would wake the workload also using API calls, and not only "web" calls).

My assumption is the following: When the workload is waking up, then an endpoint is generated and should finally become available to the kubernetes service. But in this case, Traefik is trying to access the kubernetes service while the endpoint is not totally ready (could be ready after some milliseconds). It's when the endpoint is "empty" on the following screenshot:

image004%5B51%5D

Traefik is getting the 503 when the endpoint is still at the empty stage.

Looking at the go code here, it seems like Traefik should consider the service up when the deployment is ready. Could we maybe consider checking the endpoint status? Or the service? Don't know if it could help or if it's really the root cause here...

Traefik is configured with the following options:

  • --experimental.plugins.traefik-ondemand-plugin.modulename=github.com/acouvreur/traefik-ondemand-plugin
  • --experimental.plugins.traefik-ondemand-plugin.version=v1.2.0-beta.3
  • --providers.kubernetesingress.allowEmptyServices=true

Here is how I annotated ingresses:
traefik.ingress.kubernetes.io/router.middlewares: default-ondemand-kfnqt8n476wgq28@kubernetescrd

Bascillty the traefik-ondemand-service is configured as described here: KUBERNETES.md

Tell me if you need me to provide more conf files.

And also, thank you for your help!

--

I had the same behavior on a Docker Swarm setup. So it might be related to how quick the service is determined available/healthy.

From the documentation https://doc.traefik.io/traefik/getting-started/faq/#502-bad-gateway

502 Bad Gateway

Traefik returns a 502 response code when an error happens while contacting the upstream service.

503 Service Unavailable

Traefik returns a 503 response code when a Router has been matched
but there are no servers ready to handle the request.

This situation is encountered when a service has been explicitly configured without servers,
or when a service has healthcheck enabled and all servers are unhealthy.

I think I got a "Bad Gateway" response in my case.

Maybe the 503 could be a consequence of allowEmptyServices, according to the doc?

Thanks @acouvreur I will test that on Kubernetes!

My change is only related to Swarm. I tried to get some metadata that could help me consider a service healthy for more than 5 seconds, but couldn't.

If you do, please share it with me as I'll fix it right away.

Could the age (now - creationTimestamp) of the Endpoint be considered?

Example below with a simple nginx service:
kubectl create deploy nginx --image nginx
kubectl expose deploy nginx --port 80
kubectl get endpoints nginx -o=jsonpath='{.metadata.creationTimestamp}'

2022-04-22T07:43:32

The creationTimestamp can be considered when there is no healthcheck. Bt when there's a healthcheck, is the creationTimestamp set to the first healthy check?

Yes you're right. creationTimestamp can not be considered as a reliable information on the service full availability.

I think the only way to be sure the service healthy is to check if endpoint is bound to at least 1 ip, like the following:

kubectl get endpoints nginx -o=jsonpath='{.subsets[*].addresses[*].ip}'

What do you think? Is this something that can be checked on your side? Thank you for your hard work on this plugin!

That might be a good solution, I'll look into it

Thank you!

I'm having an issue implementing it, endpoints can be created with a different name than the service/deployment right?
So if you have any lead on this... Right now I can't find a good solution for it.

The endpoint associated with a Service must always have the same name as the Service.
Kubernetes will automatically create an Endpoints object with the same name as the service

So maybe, if we want to consider a service fully available could we consider the ip from the endpoint with the same name.
Is that make sense?

I'll create this feature as an experimental flag

You can try it out on #29

Feedback welcomed

Sure. Let me test and get back to you!

Could you please add a complementary rule to the RBAC example, as below?:

  - apiGroups:
      - ""
    resources:
      - endpoints
    verbs:
      - get

Otherwise users will get a
level=error msg="endpoints "XXXXXXXXX" is forbidden: User "system:serviceaccount:default:traefik-ondemand-service" cannot get resource "endpoints" in API group "" in the namespace "default

That makes sense.
Adding it resolves the issue ?

Just deployed on test env and waiting some more few days to see if this is fully working. Getting back soon...

Unfortunately, still getting some 503s the first time. Let me gather some piece of evidence

Image used:
ghcr.io/acouvreur/traefik-ondemand-service:fix-wait-for-k8s-endpoint-to-have-one-ip

I'm also getting the same error with ghcr.io/acouvreur/traefik-ondemand-service:fix-wait-for-k8s-endpoint-to-have-one-ip

I'll take a look in a few days