vmware-tanzu / cluster-api-provider-bringyourownhost

Kubernetes Cluster API Provider BYOH for already-provisioned hosts running Linux.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issues trying to backup and restore BYOH

natitomattis opened this issue · comments

What steps did you take and what happened:
I'm trying to define a procedure to backup and restore my clusterAPI management cluster, I'm using velero to backup all the cluster resources and restore in the new cluster.

The procedure I did was the following:

  1. Create a cluster with 3 byoh hosts
  2. Pause the cluster reconciliation
  3. Generate a backup with velero with all the files
  4. Destroy the management cluster and recreate using the same set of certificates
  5. Restore the backup.

The result is that velero was not able to create byoh hosts resources

time="2022-12-06T10:19:13Z" level=error msg="error restoring XXXX: admission webhook \"vbyohost.kb.io\" denied the request: system:serviceaccount:velero:velero cannot create/update resource XXXX" 
  • Is there any step that I'm doing wrong ? Is there any procedure to backup/recover byohosts, without reprovisioning the nodes ?

What did you expect to happen:
I expected all the resources were recovered successfully.

Environment:

  • Cluster-api-provider-bringyourownhost version: 0.3.0
  • Kubernetes version: (use kubectl version --short): v1.23.5
  • OS (e.g. from /etc/os-release): Ubuntu 20.04.4 LTS

Hi @natitomattis, Thanks for trying out BYOH. For ByoHost CRs we have a validating webhook[byohost_webhook.go] which permits only manager service account or respective Host-Agent user to do create/update operations. This is indeed something that is not thought of(other systems like velero to create BYOH CRs) during implementation.

We would analyze this for any potential fixes.

@dharmjit Thanks ! I tried with the manager service account but it doesn't allow me to create a new byoh, the only allowed for creation are the Host-Agent user and this complicates the implementation of the restore process.

Please let me know if I can help in an implementation for a fix 👍🏼

I tried with the manager service account but it doesn't allow me to create a new byoh

yes, only updates are allowed for the manager SA. This is primarily from the security perspective and we thought of only allowing agents to register/create the CRs.

Please let me know if I can help in an implementation for a fix 👍🏼

sure, that would be great. You could assign this issue to yourself and I would suggest discussing implementation in this issue before raising PR.

@natitomattis This is a great use case, thank you for reporting!

Perhaps we should allow the admin to set trusted users / serviceaccounts to be able to create / update BYOHost CRs. The byohost_webhook needs to be flexible to do that. OR delegate byohost admission control to something more powerful like Kyverno - you can write policies to restrict users / serviceaccounts to do certain operations based on certain conditions.

Example - https://kyverno.io/docs/writing-policies/validate/#block-changes-to-a-custom-resource

This will not only reduce code complexity in the repo, but also provide flexibility to the user.

@natitomattis @dharmjit Let me know what you think and if you need help in doing a feasibility check.

@anusha94 I like it, but I think using kyverno will add an extra requirement for running the byoh provider, I don't know how this should be handled.

Can't we use RBAC rules to restrict the creation/update of byoh CRs to the hosts and the manager, and remove the policies from the admission webhook ?

Perhaps we should allow the admin to set trusted users / serviceaccounts to be able to create / update BYOHost CRs. The byohost_webhook needs to be flexible to do that

Thanks @anusha94 for these suggestions, I am thinking along these lines as well, but at the same time, I will prefer if there exists something in the velero itself to figure out dependencies and order the various operation(apply the admission webhooks later) or in some other way.

@anusha94 I like it, but I think using kyverno will add an extra requirement for running the byoh provider, I don't know how this should be handled.

Agreed, IMO it's better if we could solve it without other prerequisites.

@natitomattis Yup, you would have to install it once the management cluster is created - as it is outside the scope of Cluster API. And I agree just for this use case, it's a bit of a stretch unless you have a need for more policy configuration (but it is likely that in a prod environment, you would need an external tool).

something in the velero itself to figure out dependencies and order the various operation

There seems to be a restore order, but CRD restore really has to be first (which means the webhook also gets restored as part of this) and eventually blocks CR creation.

Can't we use RBAC rules to restrict the creation/update of byoh CRs to the hosts and the manager, and remove the policies from the admission webhook ?

We did consider adding User (host) to byohost_editor_clusterrolebinding so that we can control this through RBAC. Now I can't remember why we preferred the webhook approach. @dharmjit ??

Another naive way I can think of is we can add labels / annotations to byohost CR and provide exception to those in the webhook. But need to consider what happens when host is released back into the capacity pool (labels & annotations get cleaned up) and reclaimed again by a byomachine.

We did consider adding User (host) to byohost_editor_clusterrolebinding so that we can control this through RBAC. Now I can't remember why we preferred the webhook approach. @dharmjit ??

IIRC, We do not want to maintain a big number of Role/RoleBinding for each host and if we only create a RoleBinding with existing Cluster Role(byohost-editor-role) it will anyways have Cluster-wide permissions. So we thought of handling that in the admission webhook.

There seems to be a restore order, but CRD restore really has to be first (which means the webhook also gets restored as part of this) and eventually blocks CR creation.

I guess we might give it a try by keeping ValidatingWebhookConfiguration towards the end in the restore order. cc: @natitomattis