hashicorp / vault-secrets-operator

The Vault Secrets Operator (VSO) allows Pods to consume Vault secrets natively from Kubernetes Secrets.

Home Page:https://hashicorp.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Secrets (still) not being renewed occasionally

andrejvanderzee opened this issue · comments

Already reported here #364. Now on VSO version 0.4.3.

We keep having issues with secrets not being renewed by VSO. As a result, Vault revokes the secret and the consuming PODs fail to authenticate. Note that it seems nondeterministic: In some edge-cases VSO just does not renew, although in many cases it works as expected.

For example this VDS:

apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultDynamicSecret
metadata:
  name: grafana-rds
  namespace: somenamespace
spec:
  destination:
    create: true
    name: grafana-vso-rds
  mount: shared/rds
  path: creds/eks-service-grafana-owner
  renewalPercent: 67
  revoke: true
  rolloutRestartTargets:
  - kind: Deployment
    name: grafana
  vaultAuthRef: grafana

Here the Vault role:

$ vault read shared/rds/roles/eks-service-grafana-owner
Key                      Value
---                      -----
creation_statements      [CREATE ROLE "{{name}}" WITH LOGIN INHERIT ENCRYPTED PASSWORD '{{password}}' ALTER ROLE "{{name}}" SET search_path = koperit, public GRANT grafana_owner TO "{{name}}"]
credential_type          password
db_name                  postgres
default_ttl              336h
max_ttl                  2191h30m
renew_statements         []
revocation_statements    [REVOKE grafana_owner FROM "{{name}}" DROP ROLE "{{name}}"]
rollback_statements      []

We already put the max_ttl very high to prevent this issue from occurring too often.

After we get failing authenticating PODs, we have to manually re-deploy the VDS and until eventually it breaks again.

Hi @andrejvanderzee, sorry to hear that you are experiencing issues with VSO. Would you mind providing us with an example of the running VDS custom resource

kubectl get -o yaml vaultdynamicsecrets.secrets.hashicorp.com -n your-ns vds-instance

Also, including any related logs would be great.

For VaultAuth, does the backend have a max_ttl configured or is it configured to issue periodic tokens?

Thanks,

Ben

@benashz I have missed your message but let me reply now.

We are using the Kubernetes auth method which uses periodic tokens:

$ vault read auth/shared/eks/config
Key                       Value
---                       -----
disable_iss_validation    true
disable_local_ca_jwt      true
issuer                    n/a
kubernetes_ca_cert        xxxxx
kubernetes_host           https://xxxxx.gr7.eu-central-1.eks.amazonaws.com
pem_keys                  []

Here the output of a running VDS:

$ kubectl get -o yaml vaultdynamicsecrets.secrets.hashicorp.com -n companyx grafana-rds
apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultDynamicSecret
metadata:
  creationTimestamp: "2024-02-16T16:39:40Z"
  finalizers:
  - vaultdynamicsecret.secrets.hashicorp.com/finalizer
  generation: 1
  name: grafana-rds
  namespace: companyx
  resourceVersion: "333726605"
  uid: 556d245a-9a51-457a-a112-0c90717f9353
spec:
  destination:
    create: true
    name: grafana-vso-rds
    overwrite: false
  mount: shared/rds
  path: creds/eks-service-grafana-owner
  renewalPercent: 67
  revoke: true
  rolloutRestartTargets:
  - kind: Deployment
    name: grafana
  vaultAuthRef: grafana
status:
  lastGeneration: 1
  lastRenewalTime: 1712043025
  lastRuntimePodUID: 8befb254-5ff1-49a9-8fe6-8ff615b1e582
  secretLease:
    duration: 1209600
    id: shared/rds/creds/eks-service-grafana-owner/RtJlTNf9snR3BtCdZ3fDxbXG
    renewable: true
    requestID: 97669934-5e75-a79e-1907-e5cbc89abb42
  staticCredsMetaData:
    lastVaultRotation: 0
    rotationPeriod: 0
    rotationSchedule: ""
    ttl: 0

It is really driving us to the point to increase the TTLs on the leases and have a CronJob to delete expiring secrets, which is obviously a terrible approach! Would be nice to find out what is going on here.

For now I have no logs but there were ERROR messages at all.

And some additional info about the Kubernetes auth method:

$ vault read /sys/mounts/auth/shared/eks/tune
Key                  Value
---                  -----
default_lease_ttl    768h
description          n/a
force_no_cache       false
max_lease_ttl        768h
token_type           default-service
$ vault read auth/shared/eks/role/perfana-grafana
Key                                 Value
---                                 -----
alias_name_source                   serviceaccount_uid
bound_service_account_names         [perfana-grafana]
bound_service_account_namespaces    [companyx]
token_bound_cidrs                   []
token_explicit_max_ttl              0s
token_max_ttl                       0s
token_no_default_policy             false
token_num_uses                      0
token_period                        0s
token_policies                      [shared-eks-service-perfana-grafana]
token_ttl                           24h
token_type                          default

Hi @andrejvanderzee - thanks for the extra context here. We currently have an open issue where a failed renewal of the periodic token does not trigger the associated dynamic secret resources to be synced. In the case where the auth token is expired/revoked, all leases associated with it will also be revoked. My guess would be that is the issue that you are facing. The issue is most apparent when you have long secret leases. Having shorter lease TTLs on the secrets would result in a new Vault token being created. We are hoping to have that fix out soon.

@benashz Thanks for your update.

I understand the scenario, but isn't the question why the lifetimeWatcher fails to renew the periodic Vault token when token_explicit_max_ttl is not set on the auth method's role? Shouldn't VSO just retry in the next reconciliation loop in case of a temporary hick up?

The pull request reads to me as a fallback for the scenario in which the Vault token is not periodic or token_explicit_max_ttl is set and the token with it's associated secret leases are eventually revoked. But this is not the scenario that I am facing (although the pull request would fix my issue).