jetstack / jetstack-secure-gcm

Contains configuration and user guide for the Jetstack Secure for cert-manager offering on the Google Cloud Marketplace.

Home Page:https://platform.jetstack.io/docs/google-cloud-marketplace

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Update to cert-manager 1.4

maelvls opened this issue · comments

Still to be done as of 6 July 2021:

  • Deprecate 1.1 and 1.3 in the Marketplace admin UI.
  • Have a review on #60
  • Have a review on #58
  • Re-submit again and again until the review passes
    • Attempt 1 (20 June 2021)
    • Refusal 1: I submitted 1.3 as the "default" version instead of 1.4 (my fault)
    • Attempt 2 (27 June 2021)
    • Refusal 2: issue the transition from GoogleCASIssuer v1alpha1 -> v1beta1
    • Attempt 3 (29 June 2021)
    • Refusal 3 (29 June 2021): the testrunner fails with no clear indication of what is failing
    • Message from James Westby about our struggles with the testrunner (29 June 2021)
    • Google Engineer team investigating a bug with the backend (6 July 2021)
    • Refusal 4: (7 July 2021) the info field still present
    • Attempt 5 (8 July 2021), image not changed.
    • Refusal 5 (13 July 2021)

cert-manager v1.4.0 was release on 15 July 2021 and we want to update the jetstack-secure-for-cert-manager app on the Google Cloud Marketplace to be updated within a few days of each release of cert-manager.

Using the Cutting a new release instructions, we shall update the Google Cloud Marketplace app from 1.3.1 to 1.4.0.

⚠️ New Role have to be added to schema.yaml. To see what needs to be added to schema.yaml:

# From the cert-manager repo
git diff origin/release-1.3..origin/release-1.4 deploy/charts/cert-manager

Role to be added to both the cainjector and controller service accounts:

  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    resourceNames: ["cert-manager-cainjector-leader-election", "cert-manager-cainjector-leader-election-core"]
    verbs: ["get", "update", "patch"]
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["create"]

ClusterRole needed:

rules:
  - apiGroups: ["certificates.k8s.io"]
    resources: ["certificatesigningrequests"]
    verbs: ["get", "list", "watch", "update"]
  - apiGroups: ["certificates.k8s.io"]
    resources: ["certificatesigningrequests/status"]
    verbs: ["update"]
  - apiGroups: ["certificates.k8s.io"]
    resources: ["signers"]
    resourceNames: ["issuers.cert-manager.io/*", "clusterissuers.cert-manager.io/*"]
    verbs: ["sign"]
  - apiGroups: ["authorization.k8s.io"]
    resources: ["subjectaccessreviews"]
    verbs: ["create"]

Estimation: 1 hour

Note: I opened GoogleCloudPlatform/marketplace-k8s-app-tools#564 to raise the issue of not being able to create a Role that targets the kube-system namespace.

I submitted 1.4.0-gcm.0 for review, it should be published by tomorrow.

The issues I encountered:

  1. I did not pay attention to the updates made to google-cas-issuer, although the change log is very clear. Notably, I failed properly updating from v1alpha1 to v1beta1.

  2. I struggled a lot with the now required leases resource, and I ended up using a ClusterRole with resourceNames instead of a Role, and opened an issue on mpdev: GoogleCloudPlatform/marketplace-k8s-app-tools#564.

  3. Like usual, the thing that made me waste the most time was the fact that mpdev only shows status codes, not stdout nor stderr:

    >>> Running /smoke-test.yaml
     >   0: kubectl smoke test
     PASSED
     >   1: Create test issuer and self signed cert
     PASSED
     >   2: Try to get new cert
     PASSED
     >   3: Try to get cert secret
     PASSED
     >   4: Delete issuer and self signed cert
     PASSED
     >   5: Create a GoogleCASIssuer and a certificate
     FAILED: Bash test failed > Unexpected exit status code > Should have equaled 0, but was 1
     >   6: Delete google CAS issuer and certificate
     FAILED: Bash test failed > Unexpected exit status code > Should have equaled 0, but was 1
     >> Summary: 2 FAILED, 5 PASSED

    No way to know what went wrong. It really feels like "unfinished" tooling 😥
    I also opened an issue about that: GoogleCloudPlatform/marketplace-k8s-app-tools#565.

  4. Finally, the upgrade that Google did from v1beta1 to v1 of the CRD in app-crd.yaml broke the application in 1.1 and 1.3 (#59). More specifically, we had:

    apiVersion: app.k8s.io/v1beta1
    kind: Application
    spec:
      descriptor:
        ...
        info: []

    It should have been:

    apiVersion: app.k8s.io/v1beta1
    kind: Application
    spec:
      descriptor:
        ...
      info: []

    It seems like when Google upgraded the Application CRD from v1beta1 to v1 (this in the version of the CRD object, not the version of the Application itself). After this change, the above Application manifest could not be applied anymore. The error looked like this:

      error: error validating "/data/resources.yaml": error validating data:
      ValidationError(Application.spec.descriptor): unknown field "info" in
      io.k8s.app.v1beta1.Application.spec.descriptor; if you choose to ignore
      these errors, turn validation off with --validate=false

    My guess is that before this change, the faulty "info" field was not being validated, and the new v1 CRD version started validating it. I raised this pain point on their issue tracker: GoogleCloudPlatform/marketplace-k8s-app-tools#566

Update 29 June: (internal email)

The API version issue was resolved and noticed that the tester pod is failing at our verification service with the following error in the logs:

I0625 18:03:30.965105       1 main.go:86] >>> Running /smoke-test.yaml
I0625 18:03:30.966237       1 main.go:136]  >   0: kubectl smoke test
I0625 18:03:31.145790       1 main.go:141]  PASSED
I0625 18:03:31.145824       1 main.go:136]  >   1: Create test issuer and self signed cert
I0625 18:03:32.482440       1 main.go:141]  PASSED
I0625 18:03:32.482507       1 main.go:136]  >   2: Try to get new cert
E0625 18:03:32.884330       1 main.go:143]  FAILED: Bash test failed > Unexpected exit status code > Should have equaled 0, but was 1
I0625 18:03:32.884363       1 main.go:136]  >   3: Try to get cert secret
E0625 18:03:33.130651       1 main.go:143]  FAILED: Bash test failed > Unexpected exit status code > Should have equaled 0, but was 1
I0625 18:03:33.130706       1 main.go:136]  >   4: Delete issuer and self signed cert
E0625 18:03:33.648541       1 main.go:143]  FAILED: Bash test failed > Unexpected exit status code > Should have equaled 0, but was 1
I0625 18:03:33.648579       1 main.go:136]  >   5: Create a GoogleCASIssuer and a certificate
E0625 18:03:34.999642       1 main.go:143]  FAILED: Bash test failed > Unexpected exit status code > Should have equaled 0, but was 1
I0625 18:03:34.999676       1 main.go:136]  >   6: Delete google CAS issuer and certificate
E0625 18:03:36.026567       1 main.go:143]  FAILED: Bash test failed > Unexpected exit status code > Should have equaled 0, but was 1
E0625 18:03:36.026692       1 main.go:119]  >> Summary: 5 FAILED, 2 PASSED
I0625 18:03:36.026732       1 main.go:123]  >   0: kubectl smoke test: PASSED
I0625 18:03:36.026778       1 main.go:123]  >   1: Create test issuer and self signed cert: PASSED
E0625 18:03:36.026795       1 main.go:125]  >   2: Try to get new cert: FAILED
E0625 18:03:36.026802       1 main.go:125]  >   3: Try to get cert secret: FAILED
E0625 18:03:36.026807       1 main.go:125]  >   4: Delete issuer and self signed cert: FAILED
E0625 18:03:36.026812       1 main.go:125]  >   5: Create a GoogleCASIssuer and a certificate: FAILED
E0625 18:03:36.026818       1 main.go:125]  >   6: Delete google CAS issuer and certificate: FAILED
E0625 18:03:36.026824       1 main.go:95] >>> SUMMARY: 5 failed
ERROR SMOKE_TEST Tester 'Pod/smoke-test-pod' failed.

Can you make sure your application passes mpdev verify. Instructions: https://github.com/GoogleCloudPlatform/marketplace-k8s-app-tools/blob/ master/docs/mpdev-references.md#smoke-test-an-application>.

Please ensure that the tester pod completes with a zero exit status and resubmit the draft for a review. Let me know if you have any questions. > Thank you.

Regards,
Dinesh

Note that the above-mentioned test cases are defined in smoke-test.yaml.

Update 6 July: (internal email) our release of 1.4.0-gcm.0 is now waiting on Google. On 1 July 2021, Dinesh mentioned he is in contact with the engineering team.

Apologies for the delay here. I'm following up internally with Eng to see what's going wrong here -- I'll let you know once I get an answer. Thank you.

Today (13 July), Dinesh reported that the tests are failing. Dinesh now gives us the sha256 of each failing image:

Apologies for the delay here. Your listing has 3 different versions on the Marketplace --the following two deployer images are failing due to the infofield, which is not present in the CRD:

  • gcr.io/jetstack-public/jetstack-secure-for-cert-manager/deployer@sha256:732f49aac58fa25f73a5dd3a7a422f5e0520802b372676d8605a67d3a383480e
  • gcr.io/jetstack-public/jetstack-secure-for-cert-manager/deployer@sha256:d5e11520513313f08da87a58d44469aa0a0c4799ee798e4418dda321195bfe22

And the latest deployer image (gcr.io/jetstack-public/jetstack-secure-for-cert-manager/deployer@sha256:4fb179cf2a784dddb48ea86cf9e437c921b790ae060f84e16be373cc3ef108e4) is failing with the following error message:

CustomResourceDefinition.apiextensions.k8s.io "googlecasissuers.cas-issuer.jetstack.io" is invalid: status.storedVersions[0]: Invalid value: "v1alpha1": must appear in spec.versions

Please fix the above errors, validate the versions again using mpdev and resubmit the draft for approval. Thank you.

We now know that the testing infrastructure at Google is running mpdev verify sequentially on all existing versions (1.1, 1.3, 1.4). Previously, I thought the tests were only run for the latest version that we submitted.

I have now re-built and re-submitted and re-created a GitHub release draft for all three images with the info field fix:

But the fact that they run these three versions sequentially means that the v1alpha1 -> v1beta1 CRD of the Google CAS issuer breaks things as reported in the above error (see this email for more details).

I'm not sure how to go about that. I'll ask @jakexks now.

I just tried mpdev verify and found out that it only removes namespaced resources and leaves all the cluster-scoped resources behind (as per set_ownership.py). It seems to be due to the fact that ownerReferences can only be used with namespaced resources, not cluster-wide resources.

I still have no idea how to go around this issue 😞

1.1, 1.3 and 1.4 were accepted last night!!