Secrets (still) not being renewed occasionally
andrejvanderzee opened this issue · comments
Already reported here #364. Now on VSO version 0.4.3.
We keep having issues with secrets not being renewed by VSO. As a result, Vault revokes the secret and the consuming PODs fail to authenticate. Note that it seems nondeterministic: In some edge-cases VSO just does not renew, although in many cases it works as expected.
For example this VDS:
apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultDynamicSecret
metadata:
name: grafana-rds
namespace: somenamespace
spec:
destination:
create: true
name: grafana-vso-rds
mount: shared/rds
path: creds/eks-service-grafana-owner
renewalPercent: 67
revoke: true
rolloutRestartTargets:
- kind: Deployment
name: grafana
vaultAuthRef: grafana
Here the Vault role:
$ vault read shared/rds/roles/eks-service-grafana-owner
Key Value
--- -----
creation_statements [CREATE ROLE "{{name}}" WITH LOGIN INHERIT ENCRYPTED PASSWORD '{{password}}' ALTER ROLE "{{name}}" SET search_path = koperit, public GRANT grafana_owner TO "{{name}}"]
credential_type password
db_name postgres
default_ttl 336h
max_ttl 2191h30m
renew_statements []
revocation_statements [REVOKE grafana_owner FROM "{{name}}" DROP ROLE "{{name}}"]
rollback_statements []
We already put the max_ttl
very high to prevent this issue from occurring too often.
After we get failing authenticating PODs, we have to manually re-deploy the VDS and until eventually it breaks again.
Hi @andrejvanderzee, sorry to hear that you are experiencing issues with VSO. Would you mind providing us with an example of the running VDS custom resource
kubectl get -o yaml vaultdynamicsecrets.secrets.hashicorp.com -n your-ns vds-instance
Also, including any related logs would be great.
For VaultAuth, does the backend have a max_ttl
configured or is it configured to issue periodic tokens?
Thanks,
Ben
@benashz I have missed your message but let me reply now.
We are using the Kubernetes auth method which uses periodic tokens:
$ vault read auth/shared/eks/config
Key Value
--- -----
disable_iss_validation true
disable_local_ca_jwt true
issuer n/a
kubernetes_ca_cert xxxxx
kubernetes_host https://xxxxx.gr7.eu-central-1.eks.amazonaws.com
pem_keys []
Here the output of a running VDS:
$ kubectl get -o yaml vaultdynamicsecrets.secrets.hashicorp.com -n companyx grafana-rds
apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultDynamicSecret
metadata:
creationTimestamp: "2024-02-16T16:39:40Z"
finalizers:
- vaultdynamicsecret.secrets.hashicorp.com/finalizer
generation: 1
name: grafana-rds
namespace: companyx
resourceVersion: "333726605"
uid: 556d245a-9a51-457a-a112-0c90717f9353
spec:
destination:
create: true
name: grafana-vso-rds
overwrite: false
mount: shared/rds
path: creds/eks-service-grafana-owner
renewalPercent: 67
revoke: true
rolloutRestartTargets:
- kind: Deployment
name: grafana
vaultAuthRef: grafana
status:
lastGeneration: 1
lastRenewalTime: 1712043025
lastRuntimePodUID: 8befb254-5ff1-49a9-8fe6-8ff615b1e582
secretLease:
duration: 1209600
id: shared/rds/creds/eks-service-grafana-owner/RtJlTNf9snR3BtCdZ3fDxbXG
renewable: true
requestID: 97669934-5e75-a79e-1907-e5cbc89abb42
staticCredsMetaData:
lastVaultRotation: 0
rotationPeriod: 0
rotationSchedule: ""
ttl: 0
It is really driving us to the point to increase the TTLs on the leases and have a CronJob to delete expiring secrets, which is obviously a terrible approach! Would be nice to find out what is going on here.
For now I have no logs but there were ERROR messages at all.
And some additional info about the Kubernetes auth method:
$ vault read /sys/mounts/auth/shared/eks/tune
Key Value
--- -----
default_lease_ttl 768h
description n/a
force_no_cache false
max_lease_ttl 768h
token_type default-service
$ vault read auth/shared/eks/role/perfana-grafana
Key Value
--- -----
alias_name_source serviceaccount_uid
bound_service_account_names [perfana-grafana]
bound_service_account_namespaces [companyx]
token_bound_cidrs []
token_explicit_max_ttl 0s
token_max_ttl 0s
token_no_default_policy false
token_num_uses 0
token_period 0s
token_policies [shared-eks-service-perfana-grafana]
token_ttl 24h
token_type default
Hi @andrejvanderzee - thanks for the extra context here. We currently have an open issue where a failed renewal of the periodic token does not trigger the associated dynamic secret resources to be synced. In the case where the auth token is expired/revoked, all leases associated with it will also be revoked. My guess would be that is the issue that you are facing. The issue is most apparent when you have long secret leases. Having shorter lease TTLs on the secrets would result in a new Vault token being created. We are hoping to have that fix out soon.
@benashz Thanks for your update.
I understand the scenario, but isn't the question why the lifetimeWatcher fails to renew the periodic Vault token when token_explicit_max_ttl
is not set on the auth method's role? Shouldn't VSO just retry in the next reconciliation loop in case of a temporary hick up?
The pull request reads to me as a fallback for the scenario in which the Vault token is not periodic or token_explicit_max_ttl
is set and the token with it's associated secret leases are eventually revoked. But this is not the scenario that I am facing (although the pull request would fix my issue).