topolvm / topolvm

Capacity-aware CSI plugin for Kubernetes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Startup taint support

ryanfaircloth opened this issue · comments

There is a race condition where pods may be scheduled before node initialization implement a feature similar to the same in the ebs and efs csi components

kubernetes-sigs/aws-ebs-csi-driver#1232

Hi @ryanfaircloth

My first impression is that using the taint to control the startup order will be complicated if many services adopt the same strategy.

Anyway, I will discuss this topic with other maintainers next week, so please wait for a while.

@ryanfaircloth

I talked with other maintainers about this issue.
We think that the Pod Scheduling Readiness is suitable for this purpose.

For example:

  1. Add a gate by the mutating webhook to pods that use TopoLVM's PVC for which PV is already bound.
  2. If topolvm-node becomes ready on a node, delete the gate.

If you create a PR, we can review it.

I think this is not the same use case as pod readiness. What I am looking to do is more like the use case for startupTaints with cillium, ebs and efs cni/csi when a DS is must be ready before a pod is suitable for scheduling the platform is configured to add the taint at startup and it is removed by the pod when it becomes ready. See the pattern and solution for ebs

I read EBS's PR, and I still have a concern about what I said in the first comment.

My first impression is that using the taint to control the startup order will be complicated if many services adopt the same strategy.

If topolvm and cilium reside in the same cluster, I guess cilium also needs to tolerate the topolvm's taint. If many services use their own taint for this use case, we need to handle who to tolerate what kind of taint. That can be bothering.

Also, I don't understand why you think the pod scheduling readiness is not suitable for this use case. The pod scheduling readiness is used to control when a pod should be ready for scheduling, so I believe this is exactly the kind of use case for which the pod scheduling is intended.

for the node ds defaulting to tolerate all taints is a safe option for most. Readiness gates as a feature and as I can find them use are for the pod to confirm it is ready to ensure pod disruption does not impact service availability for example the ALB use of this gate where ALB is a hyper scale external load balancer is after pod is "ready" from the perspective of the control plane it may not be "ready" from the perspective of an outside HTTP client because the ALB target may not yet be updated so the control-plane should not down another pod until the ALB is actually sending traffic. As I understand it that is the purpose of readiness gates for pods. Here we are actually talking about node readiness. CNI and CSI needs to be ready
Amazon EFS/EBS and Azure CSI node level DS components are deployed with a simple exists toleration because no matter what taint is added at run time we would never want to deschedule the DS and CNI/CSI pods can startup out of order as long as they do not need CSI for network access.

for the node ds defaulting to tolerate all taints is a safe option for most.

OK. I understand it.

Readiness gates as a feature and as I can find them use are for the pod to confirm it is ready to ensure pod disruption does not impact service availability

I'm not sure if that is correct. I think the pod scheduling readiness is just to control when a pod should be considered to be ready for scheduling. Please see the assumed user stories int the KEP.

Here we are actually talking about node readiness. CNI and CSI needs to be ready

IMO, the condition is too strong if you consider that the node is not ready until the CSI node plugin pod is ready. If topolvm-node fails to start, pods that are not dependent on topolvm will also not start.

Also, as noted in the KEP for pod scheduling readiness, continuing to try to schedule pods that depend on topolvm-node before it is ready is wasteful and may delay the actual scheduling.

For these reasons, I still think that the pod scheduling readiness is better for this case.

@peng225 I still can't find anyone applying podreadiness gates as you describe the only use cases I can track down are more like the ALB use case and less like the EBS, EFS, and Cilium use cases. However they may not be incompatible in the end, and I am not saying you are wrong but I am solving a node readiness problem and the podrediness gate seems to solve a problem with pods not actually ready that would otherwise appear to be ready. The PR I am proposing #824 would not make the use of a startup taint mandatory but support it when it solves the problem I have where a pod is started slightly before topolvm provisioner. To be specific about why this is likely to happen to me more than others I use karpenter in aws to select and provision "just what I need while I need it" PVCs created by topolvm for me are cache partitions essentially ephemeral, cached copies of objects stored in s3. Where in most other use cases I could imagine the nodes exist or are created well before use mine are always created as needed.

@ryanfaircloth

The PR I am proposing #824 would not make the use of a startup taint mandatory but support it when it solves the problem I have where a pod is started slightly before topolvm provisioner.

I understand it.

Now I have come to think that pod scheduling readiness is not perfectly suitable for this use case because there might be some race condition between adding and removing the scheduling gate.

I don't think that startup taint is totally unacceptable, but I still want to find a better way. Let me ask some additional questions.

  1. Even if the pod that depends on the CSI node driver starts before the CSI node pod, it can eventually work fine by retrying to start until the CSI node pod becomes ready. I think that is how things should work on Kubernetes. What makes you think that it is not acceptable for other pods to wait for the CSI plugin?
  2. This problem occurs for any kind of CSI/CNI plugins, so a more general solution might be preferable. For example, if you add a similar function to the node driver registrar sidecar container, most CSI plugins can solve the problem. You don't need to send PR to the CSI driver project every time you introduce a new CSI plugin. Did you think about these general solutions?

One example is when using karpenter scheduling in AWS. Several pods may launch at once causing several node claims with instance of unequal sizes if we allow the pod to be schedule before topo is ready we may not have enough available disk, waiting for csi to be ready seems to eliminate this potential condition. At most we are saving 2s in the best case by not using a start up taint but I think we more often that not loose time due to retry backoffs

if we allow the pod to be schedule before topo is ready we may not have enough available disk,

Sorry, but I don't understand why you may not have enough available disk in this case. Do you mean newly-created pods are not distributed evenly from the disk capacity point of view on the new nodes? If that is correct, how do you think the problem is resolved by the startup taint?

How about my second question?

This problem occurs for any kind of CSI/CNI plugins, so a more general solution might be preferable. For example, if you add a similar function to the node driver registrar sidecar container, most CSI plugins can solve the problem. You don't need to send PR to the CSI driver project every time you introduce a new CSI plugin. Did you think about these general solutions?

if we allow the pod to be schedule before topo is ready we may not have enough available disk,

Sorry, but I don't understand why you may not have enough available disk in this case. Do you mean newly-created pods are not distributed evenly from the disk capacity point of view on the new nodes? If that is correct, how do you think the problem is resolved by the startup taint?

your assumption is available disk on nodes is the same or that pod requests are equal it may not be in my use case instances of applications are sized for their workload, local expensive NVME is used as a cache tier to avoid network bottlenecks some instances are larger than others.

How about my second question?

This problem occurs for any kind of CSI/CNI plugins, so a more general solution might be preferable. For example, if you add a similar function to the node driver registrar sidecar container, most CSI plugins can solve the problem. You don't need to send PR to the CSI driver project every time you introduce a new CSI plugin. Did you think about these general solutions?

a more general solution could work often we start with a good enough solution, until the CSI plugin teams and the upstream CSI projects can consolidate on the right way forward. As other impacted projects have already used this solution, I am still thinking this is the best outcome for now at least.

your assumption is available disk on nodes is the same or that pod requests are equal it may not be in my use case instances of applications are sized for their workload

I understand that the provisioned instances may be different in disk size, but I still don't understand why that leads to the lack of available disks.

I also don't understand if the problem persists forever until it is resolved manually, or if it is automatically resolved eventually by some Kubernetes mechanism.

I'm not sure if the problem you are explaining is really resolved by the startup taint.

We maintainers think that we want to keep the TopoLVM code base simple. I want a convincing explanation to merge your PR. I would be glad if you could elaborate more.

Thank you for your explanation.
Though I'm not fully convinced with the benefit of the startup taint, now I understand that Karpenter can recognize startup taint and handle it properly. Thus, it is beneficial for TopoLVM to support startup taint because it enables TopoLVM to work collaboratively with Karpenter.

I'll start my review of your PR.

No problem happy to discuses perspectives, and I do agree this is likely not the best solution for the ecosystem this is just the one we have today just like we started with in-tree drivers then moved to plugins I expect node readiness will eventually have a integration and this will be improved. I do think the community of c*i maintainers should be the ones to work with the upstream for that better day

This issue has been automatically marked as stale because it has not had any activity for 30 days. It will be closed in a week if no further activity occurs. Thank you for your contributions.

This issue has been automatically closed due to inactivity. Please feel free to reopen this issue (or open a new one) if this still requires investigation. Thank you for your contribution.