Issue : problem with reclaim policy delete
ccaillet1974 opened this issue · comments
HI all,
I've created, as said with my previous issue, a storageclass with reclaimPolicy "Delete" but sometimes pv have not been deleted and are on "Release" status when running kubectl get pv
[CORE-LYO0][totof@lyo0-k8s-admin00:~]$ kubectl get pv | grep -v Bound
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-06244827-4d95-4c14-a5b9-7dba47702563 100Gi RWO Delete Released cdn-bigdata/elasticsearch-data-sophie-mon-es-es-1 oc6810-hdd-iscsi 2d21h
pvc-0e708b36-3af1-4981-8d77-a9b46aafee41 2000Gi RWO Delete Terminating default/dbench0-pv-claim sc-bench 3d1h
pvc-7a1004ce-185e-4602-985c-b855702c1488 100Gi RWO Delete Terminating cdn-bigdata/elasticsearch-data-sophie-mon-es-es-0 oc6810-hdd-iscsi 2d21h
pvc-d67c3a49-2231-4542-9b87-c0b560400493 500Gi RWO Delete Released default/dbench-pv-claim sc-bench 3d17h
Those on "Terminating" status, it's due to an action from me. I tried to run the commande : kubectl delete pv <pv_name> but without effect as you could seen.
I've the following log entries when the CSI try to unstage the volume:
2023-06-12 08:24:03.962158 824169 [INFO]: [requestID:3854352599] ISCSI Start to disconnect volume ==> volume wwn is: 6a8ffba1005d5c410241c40a0000000a
2023-06-12 08:24:03.962221 824169 [INFO]: [requestID:3854352599] WaitGetLock start to get lock
2023-06-12 08:24:03.962376 824169 [INFO]: [requestID:3854352599] WaitGetLock finish to get lock
2023-06-12 08:24:03.962458 824169 [INFO]: [requestID:3854352599] Before acquire, available permits is 4
2023-06-12 08:24:03.962530 824169 [INFO]: [requestID:3854352599] After acquire, available permits is 3
2023-06-12 08:24:03.962598 824169 [INFO]: [requestID:3854352599] It took 386.759µs to acquire disConnect lock for 6a8ffba1005d5c410241c40a0000000a.
2023-06-12 08:24:03.962712 824169 [INFO]: [requestID:3854352599] Gonna run shell cmd "ls -l /dev/disk/by-id/ | grep 6a8ffba1005d5c410241c40a0000000a".
2023-06-12 08:24:03.970912 824169 [INFO]: [requestID:3854352599] Shell cmd "ls -l /dev/disk/by-id/ | grep 6a8ffba1005d5c410241c40a0000000a" result:
lrwxrwxrwx 1 root root 10 Jun 9 12:03 dm-name-36a8ffba1005d5c410241c40a0000000a -> ../../dm-0
lrwxrwxrwx 1 root root 10 Jun 9 12:03 dm-uuid-mpath-36a8ffba1005d5c410241c40a0000000a -> ../../dm-0
lrwxrwxrwx 1 root root 10 Jun 9 14:37 scsi-36a8ffba1005d5c410241c40a0000000a -> ../../dm-0
lrwxrwxrwx 1 root root 10 Jun 9 14:37 wwn-0x6a8ffba1005d5c410241c40a0000000a -> ../../dm-0
2023-06-12 08:24:03.971113 824169 [INFO]: [requestID:3854352599] Gonna run shell cmd "ls -l /dev/mapper/ | grep -w dm-0".
2023-06-12 08:24:03.978442 824169 [INFO]: [requestID:3854352599] Shell cmd "ls -l /dev/mapper/ | grep -w dm-0" result:
lrwxrwxrwx 1 root root 7 Jun 9 12:03 36a8ffba1005d5c410241c40a0000000a -> ../dm-0
2023-06-12 08:24:03.978688 824169 [ERROR]: [requestID:3854352599] Can not get DMDevice by alias: dm-0
2023-06-12 08:24:03.978763 824169 [ERROR]: [requestID:3854352599] Get DMDevice by alias:dm-0 failed. error: Can not get DMDevice by alias: dm-0
2023-06-12 08:24:03.978828 824169 [ERROR]: [requestID:3854352599] check device: dm-0 is a partition device failed. error: Get DMDevice by alias:dm-0 failed. error: Can not get DMDevice by alias: dm-0
2023-06-12 08:24:03.978894 824169 [ERROR]: [requestID:3854352599] Get device of WWN 6a8ffba1005d5c410241c40a0000000a error: check device: dm-0 is a partition device failed. error: Get DMDevice by alias:dm-0 failed. error: Can not get DMDevice by alias: dm-0
2023-06-12 08:24:03.978985 824169 [INFO]: [requestID:3854352599] Before release, available permits is 3
2023-06-12 08:24:03.979044 824169 [INFO]: [requestID:3854352599] After release, available permits is 4
2023-06-12 08:24:03.979098 824169 [INFO]: [requestID:3854352599] DeleteLockFile start to get lock
2023-06-12 08:24:03.979152 824169 [INFO]: [requestID:3854352599] DeleteLockFile finish to get lock
2023-06-12 08:24:03.979281 824169 [INFO]: [requestID:3854352599] It took 295.885µs to release disConnect lock for 6a8ffba1005d5c410241c40a0000000a.
2023-06-12 08:24:03.979349 824169 [ERROR]: [requestID:3854352599] disconnect volume failed while unstage volume, wwn: 6a8ffba1005d5c410241c40a0000000a, error: check device: dm-0 is a partition device failed. error: Get DMDevice by alias:dm-0 failed. error: Can not get DMDevice by alias: dm-0
2023-06-12 08:24:03.979424 824169 [ERROR]: [requestID:3854352599] UnStage volume pvc-06244827-4d95-4c14-a5b9-7dba47702563 error: check device: dm-0 is a partition device failed. error: Get DMDevice by alias:dm-0 failed. error: Can not get DMDevice by alias: dm-0
Thanks by advance for your replies.
Regards,
Christophe
Can I see your configuration file? The path is as follows:
[root@node ~]# vi /etc/multipath.conf
No configuration defined. Using default multipathd configuration. Give the multiptah command to run if you want more infos from a worker node.
my woreker are installed as followed
- Debian 11
- Kernel 5.10.23
- k8s : 1.25.6
- CSI version 4.0.0
I join also the configuration mutlipath obtained with the command : multipathd list config
Regards
multipathd-config.txt
For details, see section 3.6 in the user guide:
3.6 Checking the Host Multipathing Configuration
Modify the configuration, restart the UltraPath software, and create a pod again.
Firstly I DON'T USE the Ultrapath Softtware using the debian package of multipathd : multipath-tools 0.8.5-2+deb11u1 amd64
It's correctly defined on 6810 (loadbalanced mode) and worker nodes used the native multipathd software according to the result of the commande multipath -ll (results are as followed)
[CORE-LYO0][totof@lyo0-k8s-ppw01:~]$ sudo multipath -ll
[sudo] password for totof:
Jun 12 11:52:59 | sdc: prio = const (setting: emergency fallback - alua failed)
Jun 12 11:52:59 | sdd: prio = const (setting: emergency fallback - alua failed)
Jun 12 11:52:59 | sde: prio = const (setting: emergency fallback - alua failed)
Jun 12 11:52:59 | sdf: prio = const (setting: emergency fallback - alua failed)
36a8ffba1005d5c410241c40a0000000a dm-0 HUAWEI,XSG1
size=100G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
|- 7:0:0:1 sdc 8:32 active ready running
|- 10:0:0:1 sdd 8:48 active ready running
|- 9:0:0:1 sde 8:64 active ready running
`- 8:0:0:1 sdf 8:80 active ready running
Jun 12 11:52:59 | sdg: prio = const (setting: emergency fallback - alua failed)
Jun 12 11:52:59 | sdi: prio = const (setting: emergency fallback - alua failed)
Jun 12 11:52:59 | sdj: prio = const (setting: emergency fallback - alua failed)
Jun 12 11:52:59 | sdh: prio = const (setting: emergency fallback - alua failed)
36a8ffba1005d5c41024401da0000000e dm-1 HUAWEI,XSG1
size=100G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
|- 8:0:0:2 sdg 8:96 active ready running
|- 10:0:0:2 sdi 8:128 active ready running
|- 9:0:0:2 sdj 8:144 active ready running
`- 7:0:0:2 sdh 8:112 active ready running
If you use the native multipathing software provided by the OS, check whether
the /etc/multipath.conf file contains the following configuration item.
defaults {
user_friendly_names yes
find_multipaths no
}
If the configuration item does not exist, add it to the beginning of the /etc/
multipath.conf file.
You can try this operation because the root cause is that the dm name does not comply with the code verification.
So I copied the file multipathd-config.txt joined earlier to /etc/multipath with the name /etc/multipath/multipath.conf and restart the multipathd daemon with the directives added on it.
user_friendly_names yes
and change find_multipaths from "strict" value to "no" value.
And now the volume have been destroyed on my worker nodes.
After that i've deleted the pv entry on k8s with kubectl delete pv pvc-0e708b36-3af1-4981-8d77-a9b46aafee41
Now I need to :
1- propagate the configuraiton file on all worker nodes
2- check the LUN for all remaining volume to be destroyed correctly
3- clean all remaining volume on k8s that need to be destroyed
4- Test that the problem won't occured any more
I'll keep you inform
Regards
Okay, let me know if the problem is solved.
It seems to be working now for deleteion ... all pv listed on my first post have disappeared from kubernetes now i4ve only the following pv list on my cluster :
[CORE-LYO0][totof@lyo0-k8s-admin00:~]$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-161512fb-f566-4844-bba0-1f8a0f45f9e5 200Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-int-es-data-cold-0 lyo0-oc9k-nfs 142d
pvc-1dc3811c-120e-4d38-a720-b5d5022c3cbb 4Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-int-es-transforms-1 lyo0-oc9k-nfs 95d
pvc-20cce747-b4e9-4332-8e1c-883d5dbbe518 200Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-int-es-ingest-data-hot-2 oc5k-fs 130d
pvc-27815c93-6ac8-4959-9834-944fd1b35bb3 100Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-mon-es-data-1 oc6810-hdd-iscsi 3d
pvc-2a08bb21-d443-4b16-8858-2be181ed7963 200Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-int-es-ingest-data-hot-5 oc5k-fs 130d
pvc-2f4426ae-ccd8-461b-a433-0b5967fd6cec 300Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-int-es-data-warm-0 oc5k-fs 142d
pvc-3867ad01-c9e4-46ef-9a58-d128fb33f998 200Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-int-es-ingest-data-hot-1 oc5k-fs 130d
pvc-45926fe3-79de-48c4-ab42-7d3622b4b2b1 200Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-int-es-ingest-data-hot-0 oc5k-fs 130d
pvc-49043782-1d7a-4de1-b75f-ffcdb1351607 8Gi RWO Delete Bound kube-dbs/data-lyo0-sh-psql-nfs-postgresql-primary-0 lyo0-oc5k-nfs 446d
pvc-4b307039-6fe7-47ae-ac10-2b0ac722e125 300Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-int-es-data-warm-1 oc5k-fs 142d
pvc-5820d5a6-3987-4e28-b5fa-838f9ae78555 300Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-int-es-data-warm-3 oc5k-fs 91d
pvc-5a2644ab-285f-4608-b2c8-6307b88922d2 8Gi RWO Delete Bound kube-dbs/data-lyo0-sh-psql-nfs-postgresql-read-0 lyo0-oc5k-nfs 446d
pvc-5d3e2536-df0b-4570-8588-0ff47f970f42 4Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-int-es-master-1 lyo0-oc9k-nfs 142d
pvc-649c2d62-83f8-488c-b341-4278143229e6 8Gi RWO Delete Bound kube-dbs/redis-data-lyo0-sh-redis-nfs-node-1 lyo0-oc5k-nfs 446d
pvc-6c3d82b0-b994-4212-8833-cf4f5c88ee8b 100Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-mon-es-data-0 oc6810-hdd-iscsi 3d
pvc-6d9b0f9c-afce-4377-a7e7-c3882e6377aa 10Gi RWO Delete Bound kube-dbs/data-lyo0-redis-shared-redis-ha-server-0 lyo0-oc5k-nfs 445d
pvc-70952380-f9fe-42e2-b272-65ed270e8005 100Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-mon-es-data-2 oc6810-hdd-iscsi 3d
pvc-7741ede6-bf31-420b-83b3-d37f3fe0f3e5 4Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-int-es-master-2 lyo0-oc9k-nfs 142d
pvc-84aa2be8-8ee1-4982-86c2-322eeb6398d4 10Gi RWO Delete Bound kube-monitoring/lyo0-prom-int-grafana lyo0-oc5k-nfs 102d
pvc-87b87724-1fe4-4585-8c69-d8bb4d763847 200Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-int-es-ingest-data-hot-4 oc5k-fs 130d
pvc-87fa68e4-184a-48a2-b433-0699e0ad12b8 10Gi RWO Delete Bound kube-dbs/data-lyo0-redis-shared-redis-ha-server-2 lyo0-oc5k-nfs 445d
pvc-8f098f5f-3230-4c36-b74e-3e52ae0a8716 200Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-int-es-data-cold-3 lyo0-oc9k-nfs 142d
pvc-91a98c16-33dc-48e4-b331-48a00c7a4c1d 8Gi RWO Delete Bound kube-dbs/redis-data-lyo0-sh-redis-nfs-node-2 lyo0-oc5k-nfs 446d
pvc-b8fac19d-be9d-4eca-873e-25b5eb2a39f3 8Gi RWO Delete Bound kube-dbs/redis-data-lyo0-sh-redis-nfs-node-0 lyo0-oc5k-nfs 446d
pvc-d6f65be7-0697-4aef-8b42-89a6256e1ee7 200Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-int-es-ingest-data-hot-3 oc5k-fs 130d
pvc-d9814d86-c963-4fd9-87bc-0ef8efe0caf0 10Gi RWO Delete Bound kube-dbs/data-lyo0-redis-shared-redis-ha-server-1 lyo0-oc5k-nfs 445d
pvc-d9f1499d-566a-4c29-8e25-ac42bb58d38e 4Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-int-es-master-0 lyo0-oc9k-nfs 142d
pvc-dd42e771-a5f0-4708-8a71-d94297908d07 200Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-int-es-data-cold-2 lyo0-oc9k-nfs 142d
pvc-df52d229-069e-44c3-89a1-ebd9e01c6024 200Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-int-es-data-cold-1 lyo0-oc9k-nfs 142d
pvc-dfd863d7-8c83-4c25-8b38-83344053cc3a 4Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-int-es-transforms-0 lyo0-oc9k-nfs 95d
pvc-e42c5c75-9f93-4ae1-84ee-be5794453f71 300Gi RWO Delete Bound cdn-bigdata/elasticsearch-data-sophie-int-es-data-warm-2 oc5k-fs 142d
pvc-f364a229-5a8f-4d3c-af09-1f287ac800c7 1Gi RWX Delete Bound cdn-tools/lyo0-netbox-media lyo0-oc5k-nfs 445d
Checking if deletion is working well ... need to check if the problem described on issue #133 is also solved with this configuration. If is .. maybe you should displayed the multipath configuration when OS multipathd is used with the parameters needed by Huawei CSI ?
I'll keep you in touch
EDIT 1 : is find_multipaths "no" mandatory or the default debian value ("strict") will work ?
Regards
find_multipaths "no" mandatory ,For details, see section 3.6 in the csi user document.
https://github.com/Huawei/eSDK_K8S_Plugin/blob/V4.0/docs/eSDK%20Huawei%20Storage%20Kubernetes%20CSI%20Plugins%20V4.0.0%20User%20Guide%2001.pdf
Thanks for your reply
Regards.
you are welcome