kubernetes / kubernetes

Production-Grade Container Scheduling and Management

Home Page:https://kubernetes.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Volumeattachment deletion in a detach operation should carry the resourceVersion

pandaamanda opened this issue · comments

What happened?

There is a use case in the flow test that creates a pod which uses pvc and then waits about 2 minutes and then delete the pod. Later it was discovered that the pv referenced by the pod had been attached on the node and had not been detached.

Combining the csi plugin and k8s component logs, we found that the csi plugin took a long time to attach, and it was very late before it succeeded, and then it patched finalizers on the volumeattachment resource. At the same time, the deletion of the pod triggered the k8s detach operation, which will delete the The volumeattachment resource.

Due to the multiple instances of apiserver, when the volumeattachment delete operation reaches an apiserver, it does not realize that finalizers have been patched on the va resource, resulting in a successful delete operation that deletes the va.

csi plugin log:

opdisk_sts_attacher.log:I0319 16:58:00.344601       1 round_trippers.go:454] PATCH [https://123.123.0.1:443/apis/storage.k8s.io/v1/volumeattachments/csi-a9400fe6be648868b80a0e012f8aa726c23455a934926e22f1e70371fdee2cea](https://123.123.0.1/apis/storage.k8s.io/v1/volumeattachments/csi-a9400fe6be648868b80a0e012f8aa726c23455a934926e22f1e70371fdee2cea) 200 OK in 23 milliseconds

opdisk_sts_attacher.log:I0319 17:00:08.945137       1 csi_handler.go:275] Attached "csi-a9400fe6be648868b80a0e012f8aa726c23455a934926e22f1e70371fdee2cea"

opdisk_sts_attacher.log:I0319 17:00:08.945141       1 util.go:37] Marking as attached "csi-a9400fe6be648868b80a0e012f8aa726c23455a934926e22f1e70371fdee2cea"

opdisk_sts_attacher.log:I0319 17:00:08.949561       1 round_trippers.go:454] PATCH [https://123.123.0.1:443/apis/storage.k8s.io/v1/volumeattachments/csi-a9400fe6be648868b80a0e012f8aa726c23455a934926e22f1e70371fdee2cea/status](https://123.123.0.1/apis/storage.k8s.io/v1/volumeattachments/csi-a9400fe6be648868b80a0e012f8aa726c23455a934926e22f1e70371fdee2cea/status) 404 Not Found in 4 milliseconds

opdisk_sts_attacher.log:I0319 17:00:08.949646       1 csi_handler.go:236] Error processing "csi-a9400fe6be648868b80a0e012f8aa726c23455a934926e22f1e70371fdee2cea": failed to mark as attached: volumeattachments.storage.k8s.io "csi-a9400fe6be648868b80a0e012f8aa726c23455a934926e22f1e70371fdee2cea" not found

opdisk_sts_attacher.log:I0319 17:01:25.093522       1 controller.go:198] Started VA processing "csi-a9400fe6be648868b80a0e012f8aa726c23455a934926e22f1e70371fdee2cea"

opdisk_sts_attacher.log:I0319 17:01:25.093538       1 controller.go:205] VA "csi-a9400fe6be648868b80a0e012f8aa726c23455a934926e22f1e70371fdee2cea" deleted, ignoring

kube-controller-manager log:

kube-controller-manager.klog:I0319 16:58:00.333212      11 operation_generator.go:1665] Verified volume is safe to detach for volume "pvc-03cb8ebe-d2b5-410c-9db5-0a6b131d8f03" (UniqueName: "kubernetes.io/csi/opdisk.csi.openpalette.org^9df25125-5cb6-4965-9f88-bbceb277224a") on node "minion-0-0"

kube-controller-manager.klog:I0319 16:58:00.885712      11 operation_generator.go:526] DetachVolume.Detach succeeded for volume "pvc-03cb8ebe-d2b5-410c-9db5-0a6b131d8f03" (UniqueName: "kubernetes.io/csi/opdisk.csi.openpalette.org^9df25125-5cb6-4965-9f88-bbceb277224a") on node "minion-0-0"

What did you expect to happen?

Solve the problem of concurrent operations of finalizers patch and va deletion to ensure the safe deletion of va.

in pkg/volume/csi/csi_attacher.go

we can use c.plugin.volumeAttachmentLister.Get(attachID) to get va resourceVersion and pass to Delete function bellow to solve the problem.

c.k8s.StorageV1().VolumeAttachments().Delete(context.TODO(), attachID, metav1.DeleteOptions{Preconditions: &metav1.Preconditions{ResourceVersion: &resourceVersion}})

How can we reproduce it (as minimally and precisely as possible)?

Low probability of recurrence

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here
v1.28.3

Cloud provider

no

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

/sig apps

/sig storage

@pandaamanda Which version of external-attacher are you using?

@pandaamanda Which version of external-attacher are you using?

Not quite sure about the exact version. But I looked at the latest external-attacher code, and va's deletion event doesn't handle this concurrency.

/assign @jsafrane
please take a look, thank you.

/triage accepted

/unassign
@pandaamanda since you already wrote a code snippet in the issue description, can you please file a PR?

/unassign @pandaamanda since you already wrote a code snippet in the issue description, can you please file a PR?

sure.