tnozicka / openshift-acme

ACME Controller for OpenShift and Kubernetes Cluster. (Supports e.g. Let's Encrypt)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Controller gets stuck creating/deleting exposer pods

MohammadKarimi23 opened this issue · comments

What happened:
Certificate fails to get provisioned because controller creates and delete new exposer pods after a new route is added with kubernetes.io/tls-acme=true annotation.

What you expected to happen:
A fake ACME certificate should be assigned to the route (staging environment is deployed).
Also, the exposer pod should be deleted after serving http challenge.

How to reproduce it (as minimally and precisely as possible):
Creating a new route with kubernetes.io/tls-acme=true annotation.

Anything else we need to know?:
here's a part of controller logs (the original logs can be found here):

I0425 12:31:58.219136       1 route.go:217] Updating Route test-namespace/acme-test RV=2610499->2610945 UID=6a6e98c5-86f0-11ea-b7c6-fa163ef6d455->6a6e98c5-86f0-11ea-b7c6-fa163ef6d455
I0425 12:31:58.219315       1 route.go:487] Started syncing Route "test-namespace/acme-test"
I0425 12:31:58.219374       1 route.go:518] Skipping Route test-namespace/acme-test UID=6a6e98c5-86f0-11ea-b7c6-fa163ef6d455 RV=2610499
I0425 12:31:58.219389       1 route.go:489] Finished syncing Route "test-namespace/acme-test"
I0425 12:31:58.245434       1 route.go:217] Updating Route test-namespace/acme-test RV=2610499->2610945 UID=6a6e98c5-86f0-11ea-b7c6-fa163ef6d455->6a6e98c5-86f0-11ea-b7c6-fa163ef6d455
I0425 12:31:58.245503       1 route.go:487] Started syncing Route "test-namespace/acme-test"
I0425 12:31:58.245603       1 route.go:554] Route "test-namespace/acme-test" needs new certificate: Route is missing CertKey
I0425 12:31:58.246045       1 route.go:598] Using ACME client with DirectoryURL "https://acme-staging-v02.api.letsencrypt.org/directory"
I0425 12:31:59.922644       1 route.go:613] Created Order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482" for Route "test-namespace/acme-test"
I0425 12:31:59.923083       1 route.go:473] Updating status for Route test-namespace/acme-test to (*api.Status){ObservedGeneration:(int64)0 CertificateMeta:(*api.CertificateMeta)<nil> ProvisioningStatus:(api.CertProvisioningStatus){StartedAt:(time.Time)2020-04-25 12:31:59.922669229 +0000 UTC m=+321.007594668 EarliestAttemptAt:(time.Time)0001-01-01 00:00:00 +0000 UTC Failures:(int)0 OrderURI:(string)https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482 OrderStatus:(string)pending OrderError:(*api.OrderError)<nil> AccountHash:(string)} Signature:(string)}
I0425 12:31:59.935093       1 route.go:217] Updating Route test-namespace/acme-test RV=2610945->2610954 UID=6a6e98c5-86f0-11ea-b7c6-fa163ef6d455->6a6e98c5-86f0-11ea-b7c6-fa163ef6d455
I0425 12:31:59.935560       1 route.go:489] Finished syncing Route "test-namespace/acme-test"
I0425 12:31:59.935590       1 route.go:487] Started syncing Route "test-namespace/acme-test"
I0425 12:31:59.935712       1 route.go:554] Route "test-namespace/acme-test" needs new certificate: Route is missing CertKey
I0425 12:31:59.935896       1 route.go:598] Using ACME client with DirectoryURL "https://acme-staging-v02.api.letsencrypt.org/directory"
I0425 12:31:59.936422       1 route.go:217] Updating Route test-namespace/acme-test RV=2610945->2610954 UID=6a6e98c5-86f0-11ea-b7c6-fa163ef6d455->6a6e98c5-86f0-11ea-b7c6-fa163ef6d455
I0425 12:32:01.196036       1 route.go:641] Route "test-namespace/acme-test": Order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482" is in "pending" state
I0425 12:32:01.196092       1 route.go:646] Route "test-namespace/acme-test": Order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482" contains 1 authorization(s)
I0425 12:32:01.466771       1 route.go:654] Route "test-namespace/acme-test": order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482": authz "https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/51544677": is in "pending" state
I0425 12:32:01.466872       1 route.go:681] route "test-namespace/acme-test": order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482": authz "https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/51544677": challenge "pending" is in "pending" state
I0425 12:32:01.467077       1 route.go:737] Exposer route test-namespace/exposer-vlo37cfp6n7lomor75nf1qc5aspv6lrr2lmt696a5f3lo0s2ntn0 not found, creating new one.
I0425 12:32:01.488553       1 route.go:743] Created exposer Route test-namespace/exposer-vlo37cfp6n7lomor75nf1qc5aspv6lrr2lmt696a5f3lo0s2ntn0 for Route test-namespace/acme-test
I0425 12:32:01.488678       1 route.go:793] Exposer secret test-namespace/exposer-vlo37cfp6n7lomor75nf1qc5aspv6lrr2lmt696a5f3lo0s2ntn0 not found, creating new one.
I0425 12:32:01.513587       1 route.go:900] Exposer replica set test-namespace/exposer-vlo37cfp6n7lomor75nf1qc5aspv6lrr2lmt696a5f3lo0s2ntn0 not found, creating new one.
I0425 12:32:01.535227       1 route.go:955] Exposer service test-namespace/exposer-vlo37cfp6n7lomor75nf1qc5aspv6lrr2lmt696a5f3lo0s2ntn0 not found, creating new one.
I0425 12:32:01.590063       1 route.go:977] exposer Route test-namespace/exposer-vlo37cfp6n7lomor75nf1qc5aspv6lrr2lmt696a5f3lo0s2ntn0 isn't admitted yet
I0425 12:32:01.590553       1 route.go:473] Updating status for Route test-namespace/acme-test to (*api.Status){ObservedGeneration:(int64)0 CertificateMeta:(*api.CertificateMeta)<nil> ProvisioningStatus:(api.CertProvisioningStatus){StartedAt:(time.Time)2020-04-25 12:31:59.922669229 +0000 UTC EarliestAttemptAt:(time.Time)2020-04-25 12:31:59.922669229 +0000 UTC Failures:(int)0 OrderURI:(string)https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482 OrderStatus:(string)pending OrderError:(*api.OrderError)<nil> AccountHash:(string)} Signature:(string)}
I0425 12:32:01.619444       1 route.go:217] Updating Route test-namespace/acme-test RV=2610954->2610968 UID=6a6e98c5-86f0-11ea-b7c6-fa163ef6d455->6a6e98c5-86f0-11ea-b7c6-fa163ef6d455
I0425 12:32:01.619846       1 route.go:217] Updating Route test-namespace/acme-test RV=2610954->2610968 UID=6a6e98c5-86f0-11ea-b7c6-fa163ef6d455->6a6e98c5-86f0-11ea-b7c6-fa163ef6d455
I0425 12:32:01.620187       1 route.go:489] Finished syncing Route "test-namespace/acme-test"
I0425 12:32:01.620219       1 route.go:487] Started syncing Route "test-namespace/acme-test"
I0425 12:32:01.620459       1 route.go:554] Route "test-namespace/acme-test" needs new certificate: Route is missing CertKey
I0425 12:32:01.620797       1 route.go:598] Using ACME client with DirectoryURL "https://acme-staging-v02.api.letsencrypt.org/directory"
I0425 12:32:02.559869       1 route.go:641] Route "test-namespace/acme-test": Order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482" is in "pending" state
I0425 12:32:02.559917       1 route.go:646] Route "test-namespace/acme-test": Order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482" contains 1 authorization(s)
I0425 12:32:02.792956       1 route.go:654] Route "test-namespace/acme-test": order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482": authz "https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/51544677": is in "pending" state
I0425 12:32:02.793003       1 route.go:681] route "test-namespace/acme-test": order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482": authz "https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/51544677": challenge "pending" is in "pending" state
I0425 12:32:03.151779       1 route.go:996] Can't self validate exposed token before accepting the challenge: getting "http://acme-test.my-domain.io/.well-known/acme-challenge/vd18FKkh1bQWLISxxrLRh_QrsGrfXXpefNpQIsN8WlU" return status code 404, expected 200: status "404 Not Found": content head: <html>
<head><title>404 Not Found</title></head>
<body bgcolor="white">
<center><h1>404 Not Found</h1></center>
<hr><center>nginx/1.13.12</center>
I0425 12:32:03.152167       1 route.go:489] Finished syncing Route "test-namespace/acme-test"
I0425 12:32:03.152194       1 route.go:487] Started syncing Route "test-namespace/acme-test"
I0425 12:32:03.152449       1 route.go:554] Route "test-namespace/acme-test" needs new certificate: Route is missing CertKey
I0425 12:32:03.152810       1 route.go:598] Using ACME client with DirectoryURL "https://acme-staging-v02.api.letsencrypt.org/directory"
I0425 12:32:04.067586       1 route.go:641] Route "test-namespace/acme-test": Order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482" is in "pending" state
I0425 12:32:04.067626       1 route.go:646] Route "test-namespace/acme-test": Order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482" contains 1 authorization(s)
I0425 12:32:04.307983       1 route.go:654] Route "test-namespace/acme-test": order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482": authz "https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/51544677": is in "pending" state
I0425 12:32:04.308042       1 route.go:681] route "test-namespace/acme-test": order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482": authz "https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/51544677": challenge "pending" is in "pending" state
I0425 12:32:04.308357       1 route.go:988] exposer ReplicaSet test-namespace/exposer-vlo37cfp6n7lomor75nf1qc5aspv6lrr2lmt696a5f3lo0s2ntn0 isn't available yet
I0425 12:32:04.308701       1 route.go:489] Finished syncing Route "test-namespace/acme-test"
I0425 12:32:08.784958       1 route.go:487] Started syncing Route "test-namespace/acme-test"
I0425 12:32:08.785213       1 route.go:554] Route "test-namespace/acme-test" needs new certificate: Route is missing CertKey
I0425 12:32:08.785541       1 route.go:598] Using ACME client with DirectoryURL "https://acme-staging-v02.api.letsencrypt.org/directory"
I0425 12:32:09.555128       1 reflector.go:432] k8s.io/client-go@v0.17.0/tools/cache/reflector.go:108: Watch close - *v1.ConfigMap total 478 items received
I0425 12:32:10.196869       1 route.go:641] Route "test-namespace/acme-test": Order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482" is in "pending" state
I0425 12:32:10.196913       1 route.go:646] Route "test-namespace/acme-test": Order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482" contains 1 authorization(s)
I0425 12:32:10.430511       1 route.go:654] Route "test-namespace/acme-test": order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482": authz "https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/51544677": is in "pending" state
I0425 12:32:10.430561       1 route.go:681] route "test-namespace/acme-test": order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482": authz "https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/51544677": challenge "pending" is in "pending" state
I0425 12:32:10.682628       1 route.go:1006] Accepted challenge for Route test-namespace/acme-test.
I0425 12:32:10.683367       1 route.go:489] Finished syncing Route "test-namespace/acme-test"

I0425 12:32:10.683429       1 route.go:487] Started syncing Route "test-namespace/acme-test"
I0425 12:32:10.683681       1 route.go:554] Route "test-namespace/acme-test" needs new certificate: Route is missing CertKey
I0425 12:32:10.684052       1 route.go:598] Using ACME client with DirectoryURL "https://acme-staging-v02.api.letsencrypt.org/directory"
I0425 12:32:11.598231       1 route.go:641] Route "test-namespace/acme-test": Order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482" is in "invalid" state
I0425 12:32:11.598442       1 route.go:1245] Cleaning up temporary exposer for Route test-namespace/acme-test (UID=6a6e98c5-86f0-11ea-b7c6-fa163ef6d455)
I0425 12:32:11.600056       1 event.go:281] Event(v1.ObjectReference{Kind:"Route", Namespace:"test-namespace", Name:"acme-test", UID:"6a6e98c5-86f0-11ea-b7c6-fa163ef6d455", APIVersion:"route.openshift.io/v1", ResourceVersion:"2610968", FieldPath:""}): type: 'Warning' reason: 'AcmeFailedOrder' Order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482" for domain "acme-test.my-domain.io" failed: <nil>
I0425 12:32:11.618005       1 route.go:473] Updating status for Route test-namespace/acme-test to (*api.Status){ObservedGeneration:(int64)0 CertificateMeta:(*api.CertificateMeta)<nil> ProvisioningStatus:(api.CertProvisioningStatus){StartedAt:(time.Time)2020-04-25 12:31:59.922669229 +0000 UTC EarliestAttemptAt:(time.Time)2020-04-25 12:31:59.922669229 +0000 UTC Failures:(int)1 OrderURI:(string)https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482 OrderStatus:(string)invalid OrderError:(*api.OrderError)<nil> AccountHash:(string)} Signature:(string)}
I0425 12:32:11.635117       1 route.go:217] Updating Route test-namespace/acme-test RV=2610968->2611033 UID=6a6e98c5-86f0-11ea-b7c6-fa163ef6d455->6a6e98c5-86f0-11ea-b7c6-fa163ef6d455
I0425 12:32:11.635417       1 route.go:217] Updating Route test-namespace/acme-test RV=2610968->2611033 UID=6a6e98c5-86f0-11ea-b7c6-fa163ef6d455->6a6e98c5-86f0-11ea-b7c6-fa163ef6d455
I0425 12:32:11.635579       1 route.go:489] Finished syncing Route "test-namespace/acme-test"

Controller keeps creating and deleting exposer pod/service/routes!
Result of oc get routes -n test-namespace -w:

acme-test   acme-test.my-domain.io ... 1 more             hello-world   web       edge/Allow   None
exposer-vlo37cfp6n7lomor75nf1qc5aspv6lrr2lmt696a5f3lo0s2ntn0   acme-test.my-domain.io   /.well-known/acme-challenge/vd18FKkh1bQWLISxxrLRh_QrsGrfXXpefNpQIsN8WlU   exposer-vlo37cfp6n7lomor75nf1qc5aspv6lrr2lmt696a5f3lo0s2ntn0   <all>     edge/Allow   None
exposer-vlo37cfp6n7lomor75nf1qc5aspv6lrr2lmt696a5f3lo0s2ntn0   acme-test.my-domain.io   /.well-known/acme-challenge/vd18FKkh1bQWLISxxrLRh_QrsGrfXXpefNpQIsN8WlU   exposer-vlo37cfp6n7lomor75nf1qc5aspv6lrr2lmt696a5f3lo0s2ntn0   <all>     edge/Allow   None
acme-test   acme-test.my-domain.io ... 1 more             hello-world   web       edge/Allow   None
exposer-vlo37cfp6n7lomor75nf1qc5aspv6lrr2lmt696a5f3lo0s2ntn0   acme-test.my-domain.io ... 1 more   /.well-known/acme-challenge/vd18FKkh1bQWLISxxrLRh_QrsGrfXXpefNpQIsN8WlU   exposer-vlo37cfp6n7lomor75nf1qc5aspv6lrr2lmt696a5f3lo0s2ntn0   <all>     edge/Allow   None
exposer-vlo37cfp6n7lomor75nf1qc5aspv6lrr2lmt696a5f3lo0s2ntn0   acme-test.my-domain.io ... 1 more   /.well-known/acme-challenge/vd18FKkh1bQWLISxxrLRh_QrsGrfXXpefNpQIsN8WlU   exposer-vlo37cfp6n7lomor75nf1qc5aspv6lrr2lmt696a5f3lo0s2ntn0   <all>     edge/Allow   None
acme-test   acme-test.my-domain.io ... 1 more             hello-world   web       edge/Allow   None
acme-test   acme-test.my-domain.io ... 1 more             hello-world   web       edge/Allow   None
acme-test   acme-test.my-domain.io ... 1 more             hello-world   web       edge/Allow   None
exposer-sqth3fngdh0qu7o68cn6ipafl8i2o8n8a772v7if91b3rg3g54b0   acme-test.my-domain.io   /.well-known/acme-challenge/KA0Kn2y-gfWz_2vR7aWgqxYcQewMYqYMGhwaEGGrDUY   exposer-sqth3fngdh0qu7o68cn6ipafl8i2o8n8a772v7if91b3rg3g54b0   <all>     edge/Allow   None
exposer-sqth3fngdh0qu7o68cn6ipafl8i2o8n8a772v7if91b3rg3g54b0   acme-test.my-domain.io   /.well-known/acme-challenge/KA0Kn2y-gfWz_2vR7aWgqxYcQewMYqYMGhwaEGGrDUY   exposer-sqth3fngdh0qu7o68cn6ipafl8i2o8n8a772v7if91b3rg3g54b0   <all>     edge/Allow   None
acme-test   acme-test.my-domain.io ... 1 more             hello-world   web       edge/Allow   None
exposer-sqth3fngdh0qu7o68cn6ipafl8i2o8n8a772v7if91b3rg3g54b0   acme-test.my-domain.io ... 1 more   /.well-known/acme-challenge/KA0Kn2y-gfWz_2vR7aWgqxYcQewMYqYMGhwaEGGrDUY   exposer-sqth3fngdh0qu7o68cn6ipafl8i2o8n8a772v7if91b3rg3g54b0   <all>     edge/Allow   None
exposer-sqth3fngdh0qu7o68cn6ipafl8i2o8n8a772v7if91b3rg3g54b0   acme-test.my-domain.io ... 1 more   /.well-known/acme-challenge/KA0Kn2y-gfWz_2vR7aWgqxYcQewMYqYMGhwaEGGrDUY   exposer-sqth3fngdh0qu7o68cn6ipafl8i2o8n8a772v7if91b3rg3g54b0   <all>     edge/Allow   None

even the temporary route created for http challenge is responsive and returns the secret:

curl -X GET acme-test.mydomain.io/.well-known/acme-challenge/V7psMLFR30suwrF2QFjFlpLV_QB9_WjeABv2Kv45S2k
V7psMLFR30suwrF2QFjFlpLV_QB9_WjeABv2Kv45S2k.nXQVQqWv0W6RJJEIDl7B5teUDDBW6eni2QwTutC3dnE

Environment:

  • OpenShift version : v3.11.0+39132cb-398

@tnozicka

It keeps trying because the validation is failing.

I0425 12:32:11.598231       1 route.go:641] Route "test-namespace/acme-test": Order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482" is in "invalid" state
I0425 12:32:03.151779       1 route.go:996] Can't self validate exposed token before accepting the challenge: getting "http://acme-test.my-domain.io/.well-known/acme-challenge/vd18FKkh1bQWLISxxrLRh_QrsGrfXXpefNpQIsN8WlU" return status code 404, expected 200: status "404 Not Found": content head: <html>
<head><title>404 Not Found</title></head>
<body bgcolor="white">
<center><h1>404 Not Found</h1></center>
<hr><center>nginx/1.13.12</center>

this means that URL can't be reached.
Is it the actual domain you tried or redacted one?

Also nginx means it gets probably stuck on your loadbalancer. It isn't coming from from openshift-acme exposer and OCP uses HAProxy

I've sent the result of curl on temporary route in issue description (I've run it locally in my machine to make sure it's exposed).
The reason for 404 errors is probably for when the exposer pods are getting deleting and before a new one is started. you can see in the full log that there are different errors while the exposer is running.
Also nginx error is not related to loadbalancer. the pod which the route is pointing to is running nginx.
and the domain is a redacted one of course 😄

The reason I asked is because your challenge failed the verification by let's encrypt:

I0425 12:32:11.598231       1 route.go:641] Route "test-namespace/acme-test": Order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482" is in "invalid" state
I0425 12:32:11.598442       1 route.go:1245] Cleaning up temporary exposer for Route test-namespace/acme-test (UID=6a6e98c5-86f0-11ea-b7c6-fa163ef6d455)
I0425 12:32:11.600056       1 event.go:281] Event(v1.ObjectReference{Kind:"Route", Namespace:"test-namespace", Name:"acme-test", UID:"6a6e98c5-86f0-11ea-b7c6-fa163ef6d455", APIVersion:"route.openshift.io/v1", ResourceVersion:"2610968", FieldPath:""}): type: 'Warning' reason: 'AcmeFailedOrder' Order "https://acme-staging-v02.api.letsencrypt.org/acme/order/13200343/87430482" for domain "acme-test.my-domain.io" failed: <nil>

In 90% of case this is the domain that is either only in local DNS or isn't setup to direct public access to the Router.

Yeah you're right! the routers in my Openshift cluster are exposed on vitual IPs inside a private network and authorization gets failed.
This is the result I got in "detail" section of JSON result when trying to authorize manually:

No valid IP addresses found for acme-test.mydomain.io

although it would be nice if the logs showed the reason and route was marked out so controller cleans resources and doesn't retry process for the route.

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.