AttachVolume Failed for pod after pod recreated and stuck in init state
Rammurthy5 opened this issue · comments
Bug Report
a stateful workload pod was deleted, and its got recreated as expected but stuck in init state for long time. when described it says attach volume failed.
Description
postgres DB workload running on workers and the pod was deleted to see if its being recreated. And, it has but pvc attach got stuck.
Logs
Warning FailedAttachVolume 25s (x2 over 26s) attachdetach-controller AttachVolume.Attach failed for volume "pvc-556db03a-f0d2-46ed-b75c-b33d386c53d5" : rpc error: code = Internal desc = Operation failed: GenericOperation(500, "error in response: status code '500 Internal Server Error', content: 'RestJsonError { details: \"create_nexus::status: Internal, message: \\\"Failed to acquire write exclusive reservation on child nvmf://10.0.0.40:8420/nqn.2019-05.io.openebs:6d273a89-387a-4918-ada5-8bd21d6fea30?uuid=6d273a89-387a-4918-ada5-8bd21d6fea30 of nexus 556db03a-f0d2-46ed-b75c-b33d386c53d5: Failed to register key for child: NVMe IO Passthru command dh failed: NVMe IO Passthru command dh failed\\\", details: [], metadata: MetadataMap { headers: {\\\"content-type\\\": \\\"application/grpc\\\", \\\"date\\\": \\\"Tue, 18 Jun 2024 15:12:56 GMT\\\", \\\"content-length\\\": \\\"0\\\"} }\", message: \"SvcError::GrpcRequestError\", kind: Internal }'")
Warning FailedAttachVolume 16s (x3 over 23s) attachdetach-controller AttachVolume.Attach failed for volume "pvc-556db03a-f0d2-46ed-b75c-b33d386c53d5" : rpc error: code = Internal desc = Operation failed: GenericOperation(500, "error in response: status code '500 Internal Server Error', content: 'RestJsonError { details: \"create_nexus::status: Internal, message: \\\"Failed to acquire write exclusive reservation on child nvmf://10.0.0.40:8420/nqn.2019-05.io.openebs:6d273a89-387a-4918-ada5-8bd21d6fea30?uuid=6d273a89-387a-4918-ada5-8bd21d6fea30 of nexus 556db03a-f0d2-46ed-b75c-b33d386c53d5: Failed to register key for child: NVMe IO Passthru command dh failed: NVMe IO Passthru command dh failed\\\", details: [], metadata: MetadataMap { headers: {\\\"content-type\\\": \\\"application/grpc\\\", \\\"date\\\": \\\"Tue, 18 Jun 2024 15:12:58 GMT\\\", \\\"content-length\\\": \\\"0\\\"} }\", message: \"SvcError::GrpcRequestError\", kind: Internal }'")
Warning FailedMount 11s (x6 over 26s) kubelet MountVolume.SetUp failed for volume "shm" : mount failed: signal: segmentation fault
Mounting command: mount
Mounting arguments: -t tmpfs -o size=1073741824 tmpfs /var/lib/kubelet/pods/52bee99a-a48a-4b58-b996-e062b960a69e/volumes/kubernetes.io~empty-dir/shm
Output:
Warning FailedMount 11s (x6 over 26s) kubelet MountVolume.SetUp failed for volume "app-secret" : mount failed: signal: segmentation fault
Mounting command: mount
Mounting arguments: -t tmpfs -o size=1073741824 tmpfs /var/lib/kubelet/pods/52bee99a-a48a-4b58-b996-e062b960a69e/volumes/kubernetes.io~secret/app-secret
Output:
Warning FailedMount 11s (x6 over 26s) kubelet MountVolume.SetUp failed for volume "scratch-data" : assign quota FAILED createProjectID /var/lib/kubelet/pods/52bee99a-a48a-4b58-b996-e062b960a69e/volumes/kubernetes.io~empty-dir/scratch-data 0 failed unable to run xfs_quota: signal: segmentation fault
Warning FailedMount 11s (x6 over 26s) kubelet MountVolume.SetUp failed for volume "kube-api-access-bqn5m" : mount failed: signal: segmentation fault
Mounting command: mount
Mounting arguments: -t tmpfs -o size=1073741824 tmpfs /var/lib/kubelet/pods/52bee99a-a48a-4b58-b996-e062b960a69e/volumes/kubernetes.io~projected/kube-api-access-bqn5m
Output:
Warning FailedAttachVolume 8s attachdetach-controller AttachVolume.Attach failed for volume "pvc-556db03a-f0d2-46ed-b75c-b33d386c53d5" : rpc error: code = Internal desc = Operation failed: GenericOperation(500, "error in response: status code '500 Internal Server Error', content: 'RestJsonError { details: \"create_nexus::status: Internal, message: \\\"Failed to acquire write exclusive reservation on child nvmf://10.0.0.40:8420/nqn.2019-05.io.openebs:6d273a89-387a-4918-ada5-8bd21d6fea30?uuid=6d273a89-387a-4918-ada5-8bd21d6fea30 of nexus 556db03a-f0d2-46ed-b75c-b33d386c53d5: Failed to register key for child: NVMe IO Passthru command dh failed: NVMe IO Passthru command dh failed\\\", details: [], metadata: MetadataMap { headers: {\\\"content-type\\\": \\\"application/grpc\\\", \\\"date\\\": \\\"Tue, 18 Jun 2024 15:13:07 GMT\\\", \\\"content-length\\\": \\\"0\\\"} }\", message: \"SvcError::GrpcRequestError\", kind: Internal }'")
Environment
- Talos version: 1.7.2
- Kubernetes version: v1.28.3
- Platform: AWS EC2
There are many errors here, and some of them might be related to Talos, some might be not, but it would be nice to have a small reproducer test case (something minimal you could kubectl apply
) so that we can start looking into that.