harvester / harvester

Open source hyperconverged infrastructure (HCI) software

Home Page:https://harvesterhci.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

KubeVirt Certificates Expired

richevanscybermyte opened this issue · comments

Describe the bug
Cannot create, update or change VM's or images.

To Reproduce
Steps to reproduce the behavior:

  1. Run a harvester cluster on 1.1.2 for more then a year.
  2. Create a new VM or try to configure a new image.

Expected behavior
The new VM is created successfully and are able to boot the VM.
The qcow image is downloaded and ready for use.

Support bundle
Will be provided upon request.

Environment

  • Harvester ISO version: 1.1.2
  • Underlying Infrastructure Baremetal supermicro 7 node cluster. 3 controller nodes and 4 worker nodes:

Additional context
When adding a new VM I noticed that the VM was not being create. I decided to try and delete and then download the backing image again. I had previously downloaded the image a few weeks ago. After deleting and trying to add the image back it also got stuck trying to download.
Troubleshooting brought me to look into the secrets. I have found that all of the KubeVirt certificates had expired and did not automatically reprovision. The KubeVirt-CA is also expired which could cause the other cert not to roll.

Attached it a screen shot of Rancher showing all the certs expired for KubeVirt.
Screenshot from 2024-05-13 20-14-16

Please advise on how to roll the KibeVirt-CA and sub certs.

Thank You,

Rich

Wow - same issue actually - checking for a solution, but hopefully somebody can come up with the solution quickly...

Well, I am sure there will be more as lots of people started installing 1.1.2 right around this time last year.

I'm on 1.2.1 - and same issue - so I gues upgrading harvester doesn't fix this either - so I would expect a few people being affected. In theory, deleting the secrets and deleting/restarting the virt-operator pod should recreate the certificates (at least from my understanding of a typical kube-virt installation) - but I'll wait from someone from harvester to confirm...

Thanks @richevanscybermyte bring this up.

In theory, deleting the secrets and deleting/restarting the virt-operator pod should recreate the certificates (at least from my understanding of a typical kube-virt installation) - but I'll wait from someone from harvester to confirm...

Thanks for the comment @RegisHubelia 😃

We have two options to solve the expired certs.

  1. Run kubectl delete secrets -n harvester-system -l app.kubernetes.io/component=kubevirt. The KubeVirt secrets will be automatically recreated within several seconds.
  2. Alternatively, you can run kubectl edit kubevirt -n harvester-system and modify the .spec.certificateRotateStrategy to enable automatic certificate rotation. e.g.
spec:
  certificateRotateStrategy:
    selfSigned:
      ca:
        duration: 336h
        renewBefore: 24h
      server:
        duration: 168h
        renewBefore: 12h
  • duration: The requested 'duration' (i.e. lifetime) of the Certificate.
  • renewBefore: The amount of time before the currently issued certificate's "notAfter" time that we will begin to attempt to renew the certificate.

[Update]: Option 2 is not required since if we leave certificateRotateStrategy empty (i.e. certificateRotateStrategy: {}), there is a default value for duration (168 hr) and renewBefore (168 hr * 0.2) so cert auto rotate should be enabled in the beginning if we didn't modify this field.

Hi @brandboat,

Thank you very much for your reply.

Just to be clear - Based on your update section above, we shouldn't have this issue to begin with as there is already a default value? I've check my current config and currently there is no values, and I'm 100% positive nobody played around with this...
screenshot_2024-05-14-072456

I went with option 2 first - just for the sake of testing it - and Certificates were actually renewed within 2 minutes, which I deemed to be a cleaner solution IMO. So it does seem to fix the immediate "issue", tough if there is already a default value and it wasn't working - there might be something up at the time of renewing/rotating certs. So I left it as per your values, so the CA will expire on the 28th of may and the server certs on the 21st. I will monitor this and see if it gets renewed automatically or if a manual intervention is needed and revert back.

With all that said, looking at @richevanscybermyte screenshot and my own certs - non of the certificates are actually expired (I think?) - it basically says it will expire @ 4:26pm today for the server certs - and the CA cert on the 18th of May - Quite the coincidence we literally had the exact same values to the seconds for all our certs (server and ca). But based on what you are saying @brandboat, it should of been renewed 32.4h before expiry (168 hr * 0.2) - and this did not happen as they were still showing today @ 4:26:43 this morning at 7:30am. Maybe it would of renew at some point... - this is why I will monitor with the current values and revert back when we get there, as if it is indeed an issue, it could cause quite a bit of problems in scenarios where, as an example, we upgrade harvester (I would think).

@richevanscybermyte - was your localtime > 4:26pm ? Just asking as my renewed certs still shows red even tough they aren't expiring for the next 7 days - which could indeed cause confusion (I fell for it, should be in yellow as a warning when not expired and red for expired certs, IMO).

screenshot_2024-05-14-075903

Kind Regards

@RegisHubelia , You are correct my certs are showing red even though they have not expired. It is interesting though, because the certs are still set to expire this afternoon and we are within the 12h default window for experation.

So if not certs, then there is another issue on why I am getting the following error when trying to create new objects in the harvester cluster.

(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "5b9e5f0dc859155fff7714134718c932a91a0c1141876f4f6ef516c13ee8b2be": plugin type="multus" name="multus-cni-network" failed (add): Multus: [cattle-logging-system/rancher-logging-root-fluentd-0/9b43803c-a97d-4da8-b992-9427b42a5988]: error getting pod: Unauthorized

and another:

(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "1f24e84341abf445c83f7ad80e426df1b25cdadc82a0c09f9e57ff00b3a62820": plugin type="multus" name="multus-cni-network" failed (add): Multus: [longhorn-system/backing-image-manager-d7ad-cde2/78c7e7c0-9453-4abd-ad6f-c0e3f60f3cf0]: error getting pod: Unauthorized

I found another article that talked about a Token for canal expiring. I followed the steps that deleted the pods so they would be recreated kubectl delete pod -l k8s-app=canal. That did not fix my issue. The unauth I am getting makes me think the Token issue is a thing but I don't know which token is expired how to roll it.

So now I might have 2 issues:

  1. I am within my 12 hour windows and my kubevirt certs are not rolling (I will use the suggestions above to roll the certs)
  2. I have an expired token somewhere that isn't being rolled automagically.
    • I could do a redeploy of the Multus/Canal pods and see what that gets me.
      My biggest issue is I do not want to cause downtime. So I am checking here to make sure I am on the correct path.

Thank you for the help!

Rich
(update 1.) Looks like the rotation is working fine. They are always in red like @RegisHubelia had mentioned. I backed up the certs before I deleted them with kubectl get secrets -n harvester-system -l app.kubernetes.io/component=kubevirt -oyaml > ./path/to/backup.yaml then I deleted them and they came back immediately.
So now for a rollout of canal instead of a deleting of the pods? I thought it did the same thing but I will give it a try unless anyone else has a suggestion on the errors above.

Hi @richevanscybermyte ,

To be fair - I had an issue on my side where some VMs were having issues when creating backups - and after I rotated the certificate - it seems like now the backups are working again - but not sure it is related. So I'd suggest you go ahead and use option 2 to renew your certificates and see if it actually changes anything... It might be that time isn't right somewhere and the certs might be expired if time is not synched correctly..?

The other thing you could check is the logs in the pods rke2-multus-XXXXX and rke2-canal-XXXX and see if there are any pointers or errors on those pods - this might point out to the right path... Also check the virt-handler and virt-contoller pods for logs/errors.

Rich (update 1.) Looks like the rotation is working fine. They are always in red like @RegisHubelia had mentioned. I backed up the certs before I deleted them with kubectl get secrets -n harvester-system -l app.kubernetes.io/component=kubevirt -oyaml > ./path/to/backup.yaml then I deleted them and they came back immediately. So now for a rollout of canal instead of a deleting of the pods? I thought it did the same thing but I will give it a try unless anyone else has a suggestion on the errors above.

I wouldn't do that actually... Check the logs to see where the issue comes from - as I would suspect that rolling out canal/multus pod would create network issues and potential downtime - I might be wrong tough...

@RegisHubelia To late, already done. I rolled out multus and it resolved the Unauthorized issue but now the error I get back is:

FailedCreatePodSandBox Pod backing-image-ds-image-d7wz9 Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded`

This is even less to go on for an error message. Looks like longhorn now but unsure. I will take a look at logs from Longhorn and update the thread.

Update
Additional log:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to reserve sandbox name "backing-image-ds-image-d7wz9_longhorn-system_7d10689c-efba-443a-af69-1f24d4480309_6": name "backing-image-ds-image-d7wz9_longhorn-system_7d10689c-efba-443a-af69-1f24d4480309_6" is reserved for "edb7935485085c99f616d824c9a33ec6842e957835d5ab7d53aa3bb92e9d04c2"

Can you check if you have a pod named " backing-image-ds-image-d7wz9"

kubectl get pods -n longhorn-system | grep backing-image-ds-image-d7wz9

Seems like the name is actually taken already - so it cannot use the same name. It might be a stale pod...

I deleted the image because it wasn't downloading. I will recreate another so I can get the error messages again.

IT does it when I create VM's too:

FailedAttachVolume	Pod virt-launcher-ham-sandwich-sr8qh	AttachVolume.Attach failed for volume "pvc-d602c005-9b3b-4d19-9b05-1b23bef17ad1" : rpc error: code = DeadlineExceeded desc = volume pvc-d602c005-9b3b-4d19-9b05-1b23bef17ad1 failed to attach to node virtwrkr03	virt-launcher-ham-sandwich-sr8qh.17cf905e331099fd

I am not sure what is happening. When I log into Longhorn everything is green but what it looks like is that Harvester cannot talk to storage. But current VM's (I have about 50) are not affected. They power up and can talk to storage. It is only New VM's or New images.
This is a problem because I need to make new VM's and Spinn out new RKE2 (From Rancher) clusters.
Like I said, I thought this was a Certificate issue but it is not.

Do you see the newly created volume in longhorn? Are all the nodes in the longhorn schedulable?

Everything in longhorn looks good. No errors or any other issues.
Screenshot from 2024-05-15 11-56-57

And do you see the new volume in the volume tab? Can you also screenshot the Node tab?

Yes, the pvc shows up in the volumes tab. Attached is a screen shot of the Nodes tab.
Screenshot from 2024-05-15 12-29-43

Can you show me the current status of the volume - in the volume tab - click the volume? Is it currently attached? If not, if you try to manually attach it does it work?

So, I figured it out but now I have to fix some old Storage Classes.

The StorageClass class that I was using, lets call it "worker-nvme" was the default storage class. It has worked for over the last year. I had to replace a node recently and I gave the rebuilt node the same labels for host and drives under storage. For some reason that is no longer working. I would just delete the SC and rebuild it but there are a ton of other machines using that SC. Any suggestions? I am going to dig in the CRD's and see if there is any yaml I can check out that could have the wrong node UUID's or something.

@RegisHubelia thank you for sticking around for me. I will follow up here when I figure out how to adjust the SC so it works again.

-Rich

Hum... I remember having issues when changing the default storage class in the past and reverted to the harvester default. So I kept the default, and I manually specify the storage class I want to use when creating guest clusters or other VMs. I'd suggest you do the same.

As for your issue, can you show the storage class configuration (of course remove sensitive information)? Is there specific node/disks selectors on your SC?

There are selectors in the storageClasses. Here are the two:
Old - No longer works
Screenshot from 2024-05-15 13-50-40

And the New SC that does work:
Screenshot from 2024-05-15 13-49-07

They literally have the exact same parameters. I can give the old (worker-node-nvme) a go and see if something just shook loose. Never know.

Can you provide the yaml? this would show everything - there might be something different not in the screenshot.

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    field.cattle.io/description: Storage hosted within the worker nodes
    storageclass.beta.kubernetes.io/is-default-class: 'false'
    storageclass.kubernetes.io/is-default-class: 'false'
  creationTimestamp: '2023-05-08T12:31:34Z'
  managedFields:
    - apiVersion: storage.k8s.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:allowVolumeExpansion: {}
        f:metadata:
          f:annotations:
            .: {}
            f:field.cattle.io/description: {}
        f:parameters:
          .: {}
          f:diskSelector: {}
          f:migratable: {}
          f:nodeSelector: {}
          f:numberOfReplicas: {}
          f:staleReplicaTimeout: {}
        f:provisioner: {}
        f:reclaimPolicy: {}
        f:volumeBindingMode: {}
      manager: harvester
      operation: Update
      time: '2023-05-08T12:31:34Z'
    - apiVersion: storage.k8s.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            f:storageclass.beta.kubernetes.io/is-default-class: {}
            f:storageclass.kubernetes.io/is-default-class: {}
      manager: Mozilla
      operation: Update
      time: '2023-05-08T12:31:53Z'
  name: worker-node-nvme
  resourceVersion: '790933475'
  uid: 653eb517-d8ec-46ec-b4ed-e1dc004b29c8
parameters:
  diskSelector: nvme
  migratable: 'true'
  nodeSelector: tenant
  numberOfReplicas: '3'
  staleReplicaTimeout: '30'
provisioner: driver.longhorn.io
reclaimPolicy: Delete
volumeBindingMode: Immediate

New - Working

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.beta.kubernetes.io/is-default-class: 'true'
    storageclass.kubernetes.io/is-default-class: 'true'
  creationTimestamp: '2024-05-15T16:51:01Z'
  managedFields:
    - apiVersion: storage.k8s.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:allowVolumeExpansion: {}
        f:parameters:
          .: {}
          f:diskSelector: {}
          f:migratable: {}
          f:nodeSelector: {}
          f:numberOfReplicas: {}
          f:staleReplicaTimeout: {}
        f:provisioner: {}
        f:reclaimPolicy: {}
        f:volumeBindingMode: {}
      manager: harvester
      operation: Update
      time: '2024-05-15T16:51:01Z'
    - apiVersion: storage.k8s.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:storageclass.beta.kubernetes.io/is-default-class: {}
            f:storageclass.kubernetes.io/is-default-class: {}
      manager: Mozilla
      operation: Update
      time: '2024-05-15T16:51:14Z'
  name: worker-nvme-local
  resourceVersion: '790933476'
  uid: 4a677fa4-a47f-467a-8953-7ba900e92f22
parameters:
  diskSelector: nvme
  migratable: 'true'
  nodeSelector: tenant
  numberOfReplicas: '3'
  staleReplicaTimeout: '30'
provisioner: driver.longhorn.io
reclaimPolicy: Delete
volumeBindingMode: Immediate

I saw this but now that we are narrowing down on the SC it seems more relevant. Like there is a selector missing or something.

Pod virt-launcher-spam-sandwich-lhsgv
0/7 nodes are available: 7 pod has unbound immediate PersistentVolumeClaims. preemption: 0/7 nodes are available: 7 Preemption is not helpful for scheduling.

I have solved my issue. I deleted the original StorageClass and recreated it with the same name and same values as I had before. The machines that we not coming online because of migration or being create; I restarted these machines and now they are up.
Deleted worker-node-nvme storageClass. Immediately recreated it exactly as the original.

Edit
Thank you @RegisHubelia for all of your help. If you had not kept me digging I don't know that I would have found the fix.

TLDR; Some how the storageClass was not selecting any storage. I deleted and then recreated the storageClass exactly like the original and everything came up.

Glad you found a workaround and that I could somehow help. Still seems a bit strange that it stopped working all of a sudden...

Maybe too late to ask, but I was wondering is it possible to upload the support bundles from your cluster ? @RegisHubelia @richevanscybermyte ? It might be helpful to understand the root cause of the errors you mentioned in this issue. Many thanks 😃

Happy to provide it - I created it. How can I get it to you securely? I can upload it on an encrypted disk and provide the link - but I'd rather not do that publicly..

Hi @RegisHubelia, could you email the link to cooper.tseng@suse.com ? Many thanks !