Startup taint support

Question

Startup taint support

ryanfaircloth opened this issue 5 months ago · comments

There is a race condition where pods may be scheduled before node initialization implement a feature similar to the same in the ebs and efs csi components

kubernetes-sigs/aws-ebs-csi-driver#1232

Shinya Hayashi · Answer 1 · Fri Jan 12 2024 13:38:08 GMT+0800 (China Standard Time)

Hi @ryanfaircloth

My first impression is that using the taint to control the startup order will be complicated if many services adopt the same strategy.

Anyway, I will discuss this topic with other maintainers next week, so please wait for a while.

Shinya Hayashi · Answer 2 · Wed Jan 17 2024 17:22:51 GMT+0800 (China Standard Time)

@ryanfaircloth

I talked with other maintainers about this issue.
We think that the Pod Scheduling Readiness is suitable for this purpose.

For example:

Add a gate by the mutating webhook to pods that use TopoLVM's PVC for which PV is already bound.
If topolvm-node becomes ready on a node, delete the gate.

If you create a PR, we can review it.

Ryan Faircloth · Answer 3 · Wed Jan 17 2024 19:02:30 GMT+0800 (China Standard Time)

I think this is not the same use case as pod readiness. What I am looking to do is more like the use case for startupTaints with cillium, ebs and efs cni/csi when a DS is must be ready before a pod is suitable for scheduling the platform is configured to add the taint at startup and it is removed by the pod when it becomes ready. See the pattern and solution for ebs

Ryan Faircloth · Answer 4 · Wed Jan 17 2024 19:04:23 GMT+0800 (China Standard Time)

PR from EBS I could port here kubernetes-sigs/aws-ebs-csi-driver@c731fb0

Shinya Hayashi · Answer 5 · Thu Jan 18 2024 16:52:19 GMT+0800 (China Standard Time)

I read EBS's PR, and I still have a concern about what I said in the first comment.

My first impression is that using the taint to control the startup order will be complicated if many services adopt the same strategy.

If topolvm and cilium reside in the same cluster, I guess cilium also needs to tolerate the topolvm's taint. If many services use their own taint for this use case, we need to handle who to tolerate what kind of taint. That can be bothering.

Also, I don't understand why you think the pod scheduling readiness is not suitable for this use case. The pod scheduling readiness is used to control when a pod should be ready for scheduling, so I believe this is exactly the kind of use case for which the pod scheduling is intended.

Ryan Faircloth · Answer 6 · Thu Jan 18 2024 22:40:38 GMT+0800 (China Standard Time)

for the node ds defaulting to tolerate all taints is a safe option for most. Readiness gates as a feature and as I can find them use are for the pod to confirm it is ready to ensure pod disruption does not impact service availability for example the ALB use of this gate where ALB is a hyper scale external load balancer is after pod is "ready" from the perspective of the control plane it may not be "ready" from the perspective of an outside HTTP client because the ALB target may not yet be updated so the control-plane should not down another pod until the ALB is actually sending traffic. As I understand it that is the purpose of readiness gates for pods. Here we are actually talking about node readiness. CNI and CSI needs to be ready
Amazon EFS/EBS and Azure CSI node level DS components are deployed with a simple exists toleration because no matter what taint is added at run time we would never want to deschedule the DS and CNI/CSI pods can startup out of order as long as they do not need CSI for network access.

Shinya Hayashi · Answer 7 · Mon Jan 22 2024 15:48:51 GMT+0800 (China Standard Time)

for the node ds defaulting to tolerate all taints is a safe option for most.

OK. I understand it.

Readiness gates as a feature and as I can find them use are for the pod to confirm it is ready to ensure pod disruption does not impact service availability

I'm not sure if that is correct. I think the pod scheduling readiness is just to control when a pod should be considered to be ready for scheduling. Please see the assumed user stories int the KEP.

Here we are actually talking about node readiness. CNI and CSI needs to be ready

IMO, the condition is too strong if you consider that the node is not ready until the CSI node plugin pod is ready. If topolvm-node fails to start, pods that are not dependent on topolvm will also not start.

Also, as noted in the KEP for pod scheduling readiness, continuing to try to schedule pods that depend on topolvm-node before it is ready is wasteful and may delay the actual scheduling.

For these reasons, I still think that the pod scheduling readiness is better for this case.

Ryan Faircloth · Answer 8 · Tue Jan 30 2024 00:31:41 GMT+0800 (China Standard Time)

@peng225 I still can't find anyone applying podreadiness gates as you describe the only use cases I can track down are more like the ALB use case and less like the EBS, EFS, and Cilium use cases. However they may not be incompatible in the end, and I am not saying you are wrong but I am solving a node readiness problem and the podrediness gate seems to solve a problem with pods not actually ready that would otherwise appear to be ready. The PR I am proposing #824 would not make the use of a startup taint mandatory but support it when it solves the problem I have where a pod is started slightly before topolvm provisioner. To be specific about why this is likely to happen to me more than others I use karpenter in aws to select and provision "just what I need while I need it" PVCs created by topolvm for me are cache partitions essentially ephemeral, cached copies of objects stored in s3. Where in most other use cases I could imagine the nodes exist or are created well before use mine are always created as needed.

Shinya Hayashi · Answer 9 · Thu Feb 08 2024 09:32:09 GMT+0800 (China Standard Time)

@ryanfaircloth

The PR I am proposing #824 would not make the use of a startup taint mandatory but support it when it solves the problem I have where a pod is started slightly before topolvm provisioner.

I understand it.

Now I have come to think that pod scheduling readiness is not perfectly suitable for this use case because there might be some race condition between adding and removing the scheduling gate.

I don't think that startup taint is totally unacceptable, but I still want to find a better way. Let me ask some additional questions.

Even if the pod that depends on the CSI node driver starts before the CSI node pod, it can eventually work fine by retrying to start until the CSI node pod becomes ready. I think that is how things should work on Kubernetes. What makes you think that it is not acceptable for other pods to wait for the CSI plugin?
This problem occurs for any kind of CSI/CNI plugins, so a more general solution might be preferable. For example, if you add a similar function to the node driver registrar sidecar container, most CSI plugins can solve the problem. You don't need to send PR to the CSI driver project every time you introduce a new CSI plugin. Did you think about these general solutions?

Ryan Faircloth · Answer 10 · Fri Feb 09 2024 10:23:28 GMT+0800 (China Standard Time)

One example is when using karpenter scheduling in AWS. Several pods may launch at once causing several node claims with instance of unequal sizes if we allow the pod to be schedule before topo is ready we may not have enough available disk, waiting for csi to be ready seems to eliminate this potential condition. At most we are saving 2s in the best case by not using a start up taint but I think we more often that not loose time due to retry backoffs

Shinya Hayashi · Answer 11 · Wed Feb 14 2024 12:47:28 GMT+0800 (China Standard Time)

if we allow the pod to be schedule before topo is ready we may not have enough available disk,

Sorry, but I don't understand why you may not have enough available disk in this case. Do you mean newly-created pods are not distributed evenly from the disk capacity point of view on the new nodes? If that is correct, how do you think the problem is resolved by the startup taint?

Shinya Hayashi · Answer 12 · Wed Feb 14 2024 12:47:37 GMT+0800 (China Standard Time)

How about my second question?

This problem occurs for any kind of CSI/CNI plugins, so a more general solution might be preferable. For example, if you add a similar function to the node driver registrar sidecar container, most CSI plugins can solve the problem. You don't need to send PR to the CSI driver project every time you introduce a new CSI plugin. Did you think about these general solutions?

Ryan Faircloth · Answer 13 · Wed Feb 14 2024 20:24:00 GMT+0800 (China Standard Time)

if we allow the pod to be schedule before topo is ready we may not have enough available disk,

Sorry, but I don't understand why you may not have enough available disk in this case. Do you mean newly-created pods are not distributed evenly from the disk capacity point of view on the new nodes? If that is correct, how do you think the problem is resolved by the startup taint?

your assumption is available disk on nodes is the same or that pod requests are equal it may not be in my use case instances of applications are sized for their workload, local expensive NVME is used as a cache tier to avoid network bottlenecks some instances are larger than others.

Ryan Faircloth · Answer 14 · Wed Feb 14 2024 20:26:47 GMT+0800 (China Standard Time)

How about my second question?

This problem occurs for any kind of CSI/CNI plugins, so a more general solution might be preferable. For example, if you add a similar function to the node driver registrar sidecar container, most CSI plugins can solve the problem. You don't need to send PR to the CSI driver project every time you introduce a new CSI plugin. Did you think about these general solutions?

a more general solution could work often we start with a good enough solution, until the CSI plugin teams and the upstream CSI projects can consolidate on the right way forward. As other impacted projects have already used this solution, I am still thinking this is the best outcome for now at least.

Shinya Hayashi · Answer 15 · Thu Feb 15 2024 09:32:08 GMT+0800 (China Standard Time)

your assumption is available disk on nodes is the same or that pod requests are equal it may not be in my use case instances of applications are sized for their workload

I understand that the provisioned instances may be different in disk size, but I still don't understand why that leads to the lack of available disks.

I also don't understand if the problem persists forever until it is resolved manually, or if it is automatically resolved eventually by some Kubernetes mechanism.

I'm not sure if the problem you are explaining is really resolved by the startup taint.

We maintainers think that we want to keep the TopoLVM code base simple. I want a convincing explanation to merge your PR. I would be glad if you could elaborate more.

Ryan Faircloth · Answer 16 · Thu Feb 15 2024 09:49:50 GMT+0800 (China Standard Time)

Yes the problem persists as far as I can tell if startup taints are not used I’ve run into similar issues in other csi AWS ebs for example. Which is why I went to this solution it works and is the same conclusion others have come to in the community. As it’s a very simple and community accepted solution that assertion is supported by examples in several other projects. Alternatives mentioned seem to be more complicated or require much wider change in the eco system.

…

On Wed, Feb 14, 2024 at 8:32 PM Shinya Hayashi ***@***.***> wrote: your assumption is available disk on nodes is the same or that pod requests are equal it may not be in my use case instances of applications are sized for their workload I understand that the provisioned instances may be different in disk size, but I still don't understand why that leads to the lack of available disks. I also don't understand if the problem persists forever until it is resolved manually, or if it is automatically resolved eventually by some Kubernetes mechanism. I'm not sure if the problem you are explaining is really resolved by the startup taint. We maintainers think that we want to keep the TopoLVM code base simple. I want a convincing explanation to merge your PR. I would be glad if you could elaborate more. — Reply to this email directly, view it on GitHub <#809 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIN6WODNY7KS455EBNLKZLTYTVQSLAVCNFSM6AAAAABBTIFD5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBVGIZDENJRGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Shinya Hayashi · Answer 17 · Fri Feb 16 2024 14:18:10 GMT+0800 (China Standard Time)

Thank you for your explanation.
Though I'm not fully convinced with the benefit of the startup taint, now I understand that Karpenter can recognize startup taint and handle it properly. Thus, it is beneficial for TopoLVM to support startup taint because it enables TopoLVM to work collaboratively with Karpenter.

I'll start my review of your PR.

Ryan Faircloth · Answer 18 · Fri Feb 16 2024 20:38:54 GMT+0800 (China Standard Time)

No problem happy to discuses perspectives, and I do agree this is likely not the best solution for the ecosystem this is just the one we have today just like we started with in-tree drivers then moved to plugins I expect node readiness will eventually have a integration and this will be improved. I do think the community of c*i maintainers should be the ones to work with the upstream for that better day

github-actions · Answer 19 · Mon Mar 18 2024 07:34:37 GMT+0800 (China Standard Time)

This issue has been automatically marked as stale because it has not had any activity for 30 days. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions · Answer 20 · Tue Mar 26 2024 07:34:21 GMT+0800 (China Standard Time)

This issue has been automatically closed due to inactivity. Please feel free to reopen this issue (or open a new one) if this still requires investigation. Thank you for your contribution.