kubernetes-sigs / kueue

Kubernetes-native Job Queueing

Home Page:https://kueue.sigs.k8s.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support Argo/Tekton workflows

ahg-g opened this issue · comments

This is lower priority than #65, but it would be good to have an integration with a workflow framework.

Argo supports the suspend flag, the tricky part is that suspend is for the whole workflow, meaning a QueuedWorkload would need to represent the resources of the whole workflow all at once.

Ideally Argo should create jobs per sequential step, and then resource reservation happens one step at a time.

FYI @terrytangyuan

Also, extracted from a comment in https://bit.ly/kueue-apis (can't find the person's github)

A compromise might be a way of submitting a job, but have it "paused" so that the workflow manager can unpause it after its deps have been met, but the job still can wait in line in the queue so it doesn't add a lot of wall clock time. The scheduler would ignore any paused jobs until they are unpaused?

The idea is to allow for a dependent job to jump to the head of the queue when the dependencies are met.

Yes, but it essentially only jumps to the head of the line if it already was at the head of the line.

I guess I'll have to read through the design doc for queue APIs in order to understand the use case better here. Any thoughts on what the integration looks like and how the two interoperate with each other?

Consider there to be two components. a queue, and a scheduler.
The queue is where jobs wait in line. A scheduler picks entries to work on at the head of the line.

Sometimes in the real world, its a family waiting in line. One member goes off to use the bathroom. If they are not back by the time its their turn, they usually say, "let the next folks go, we're not ready yet". The scheduler in this case just ignores that entry and goes to the next entry in the queue. The option to allow jobs to be "not ready yet, don't schedule me, but still queue me" could be interesting to various workflow managers.

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

/remove-lifecycle stale

Would a similar integration like Argo and Volcano work in this case?

https://github.com/volcano-sh/volcano/blob/master/example/integrations/argo/20-job-DAG.yaml

Not really. That seems to be creating a different job for each step of the workflow. Then, each job enters the queues only after the previous step has finished. This can already be accomplished with Kueue and batch/v1.Job.

We would like to enhance the experience roughly as described here: #74 (comment)

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

/remove-lifecycle stale

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

/remove-lifecycle stale

Hi, I am trying to figure out if I could use Kueue for queueing Tekton PipelineRuns (more info on tekton at tekton.dev/docs). From reading bit.ly/kueue-apis, it seems like Kueue is going to have separate controllers that create Workload objects for different types of workloads (although I'm not sure if that's the case yet).

Would it be reasonable to write a separate controller that creates Workload objects for pending PipelineRuns, and starts the PipelineRuns when the workload is admitted by the queue? I'm not sure if this is possible because it seems like kueue somehow mutates the workloads' node affinity directly, and the relationship between PipelineRuns and pod specs doesn't work in quite the same way as between Jobs and pod specs.

I'm also curious if it's possible to create a queue that is just based on count of running objects rather than their compute resource requirements.

More details on what I'm trying to do: https://github.com/tektoncd/community/blob/main/teps/0132-queueing-concurrent-runs.md

it seems like Kueue is going to have separate controllers that create Workload objects for different types of workloads (although I'm not sure if that's the case yet).

These controllers can live in the Kueue repo, the tekton repo or a new repo altogether.
We currently have a controller for kubeflow MPIJob in the kueue repo. If the Tekton community is open to have this integration, we can discuss where is the best place to put it.

Would it be reasonable to write a separate controller that creates Workload objects for pending PipelineRuns, and starts the PipelineRuns when the workload is admitted by the queue?

Depends on what you want. When talking about workflows, there are two possibilities: (a) queue the entire workflow or (b) queue the steps.

I'm not sure if this is possible because it seems like kueue somehow mutates the workloads' node affinity directly, and the relationship between PipelineRuns and pod specs doesn't work in quite the same way as between Jobs and pod specs.

Injecting node affinities is the mechanism to support fungibility (example: this job can run on ARM or x86, let kueue decide to run it where there is still quota). If this is not something that matters to you, you can not create flavors.

I'm also curious if it's possible to create a queue that is just based on count of running objects rather than their compute resource requirements.

Kueue is a quota-based system. Currently it uses pod resource requests and we plan to add number of pods #485.
What kind of object would make sense to count in Tekton? I would expect that there should be resource requests somewhere.

I'll comment more when I finish reading the doc above. Thanks for sharing :)

cc @kerthcet

Thanks for your response!

These controllers can live in the Kueue repo, the tekton repo or a new repo altogether. We currently have a controller for kubeflow MPIJob in the kueue repo. If the Tekton community is open to have this integration, we can discuss where is the best place to put it.

Still in the early exploration phase, but looking forward to discussing more what would work!

Kueue is a quota-based system. Currently it uses pod resource requests and we plan to add number of pods #485. What kind of object would make sense to count in Tekton? I would expect that there should be resource requests somewhere.

Tekton uses PipelineRuns, which are DAGs of TaskRuns, and each TaskRun corresponds to a pod. One of our use cases is basically just to avoid overwhelming a kube cluster, in which case queueing based on resource requirements would be useful. However, there are some wrinkles with how we handle resource requirements, since we have containers running sequentially in a pod rather than in parallel, so the default k8s assumption that pod resource requirements are the sum of container resource requirements doesn't apply. For this reason, queueing based on TaskRun or PipelineRun count may be simpler for us. Since TaskRuns correspond to pods, queueing based on pod count would solve the TaskRun use case at least.

We also have some use cases that would probably need to be met in Tekton with a wrapper API (e.g. "I want to have only 5 PipelineRuns at a time of X Pipeline that communicates with a rate-limited service"; "I want to have only one deployment PipelineRun running at a time", etc). If we could use Kueue to create a queue of at most X TaskRuns, we'd be in good shape to design something in Tekton meeting these needs.

Since TaskRuns correspond to pods, queueing based on pod count would solve the TaskRun use case at least.

Yes, the pod count would help. But I would encourage users to also add pod requests. This is particularly important for HPC workflows. You might want dedicated CPUs and accelerators.

I agree that it wouldn't make sense to queue at a lower level than TaskRuns.

You are welcome to add a topic to our WG Batch meetings if you want to show your design proposals for queuing workflows.

https://docs.google.com/document/d/1XOeUN-K0aKmJJNq7H07r74n-mGgSFyiEDQ3ecwsGhec/edit

One feedback for this is we have Tekton+ArgoCD as our CICD pipelines, for cost effectiveness, we deploy tekton together with other application services(non-productive), what will happen is we will run into insufficient resources when there're a lot of CI runs. So we have to isolate them. Queueing is important for tekton as well I think.

We have waitForPodsReady which will wait until the previous job has enough pods running, I think we can expand this to like pendingForTargetQuantity, for job, it will still return the pod number, but for tekton, it will wait for target number of pipelineRuns/taskRuns, but we need to implement the suspend in pipelineRun/taskRun.

I think resource management is great for tekton, but if no, we can also make it out by watching the pipelineRun/taskRun amount. But this needs a refactor to kueue for now resources are required. Just for brainstorming.

Another concern is about preemption, I think it will be dangerous for tekton in some cases. Like deploying applications.

@alculquicondor @ahg-g I added argoproj/argo-workflows#12363 to track and hopefully would attract more contributors to work on this.

@terrytangyuan FYI: we're working on kubernetes/kubernetes#121681 for workflow support.

It is possible to use pod-level integration using the Plain Pods approach.

We use this config snippet (from kueue-manager-config) to integrate Argo Workflows into Kueue:

          integrations:
            frameworks:
            - "pod"
            podOptions:
              # You can change namespaceSelector to define in which
              # namespaces kueue will manage the pods.
              namespaceSelector:
                matchExpressions:
                - key: kubernetes.io/metadata.name
                  operator: NotIn
                  values: [ kube-system, kueue-system ]
              # Kueue uses podSelector to manage pods with particular
              # labels. The default podSelector will match all the pods.
              podSelector:
                matchExpressions:
                - key: workflows.argoproj.io/completed
                  operator: In
                  values: [ "false", "False", "no" ]

This configuration adds a scheduling gate to each Argo Workflows pod and will only release it once there is quota available.

It is possible to use pod-level integration using the Plain Pods approach.

We use this config snippet (from kueue-manager-config) to integrate Argo Workflows into Kueue:

          integrations:
            frameworks:
            - "pod"
            podOptions:
              # You can change namespaceSelector to define in which
              # namespaces kueue will manage the pods.
              namespaceSelector:
                matchExpressions:
                - key: kubernetes.io/metadata.name
                  operator: NotIn
                  values: [ kube-system, kueue-system ]
              # Kueue uses podSelector to manage pods with particular
              # labels. The default podSelector will match all the pods.
              podSelector:
                matchExpressions:
                - key: workflows.argoproj.io/completed
                  operator: In
                  values: [ "false", "False", "no" ]

This configuration adds a scheduling gate to each Argo Workflows pod and will only release it once there is quota available.

Thanks for putting an example here :)

Yes, that's right. The plain pod integration could potentially support the ArgoWorkflow.
However, the plain pod integration doesn't support all kueue features, such as partial admission. So the native ArgoWorkflkow support would be worth it.

Regarding the features not supported in the plain pod integration, please see for more details: https://github.com/kubernetes-sigs/kueue/tree/main/keps/976-plain-pods#non-goals

Oh that's cool. How do you set up the queue-name in the Pods?

I'm not familiar with Argo. Does it have support for pods working in parallel or pods that all need to start together?

Another thing to note is that behavior you are getting is that Pods are created when their dependencies complete. Meaning that, in a busy cluster, a workflow might be spending too much time waiting in the queue for each step. Is this acceptable?

It's probably acceptable for some users. Would you be willing to write a tutorial for the kueue website?

Oh that's cool. How do you set up the queue-name in the Pods?

You can use either spec.template[].metadata or spec.podMetadata to define a queue.

I'm not familiar with Argo. Does it have support for pods working in parallel or pods that all need to start together?

Argo supports parallel execution of pods, and those pods are only created when each "node" of the workflow is ready to run.
This type of integration simply prevents each pod from executing until they pass Kueue's admission checks.

Another thing to note is that behavior you are getting is that Pods are created when their dependencies complete. Meaning that, in a busy cluster, a workflow might be spending too much time waiting in the queue for each step. Is this acceptable?

I'm still waiting to see how well it works. I don't expect the wait time between nodes to be a problem, but a backlog of partially complete workflows may become problematic.

Most of the use cases revolve around ETL nodes followed by process nodes and vice-versa. Depending on how the queues are configured, I could end up with too many partially complete workflows that take up ephemeral resources.

It's probably acceptable for some users. Would you be willing to write a tutorial for the kueue website?

Sure.

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

/remove-lifecycle stale

Is there any progress for supporting argo/tekon workflows?

I don't think anyone has followed through with it. Would you like to propose something?
I think we might require changes in both projects, but at least the Argo community is in favor of doing something: argoproj/argo-workflows#12363

@alculquicondor I'm confused. Isn't it possible to support argo-workflows indirectly through pod integration?

It is indeed possible. But a tighter integration, with atomic admission, would be beneficial.

If the user want to run the step which contains multi pods only when all pods can run, we may need some methods to know which pods should be in the same workload. So only pod integration may not enough.

cc @Zhuzhenghao Discussion about integrating Kueue with tekton.