Support Argo/Tekton workflows

Question

Support Argo/Tekton workflows

ahg-g opened this issue 2 years ago · comments

Abdullah Gharaibeh commented 2 years ago

This is lower priority than #65, but it would be good to have an integration with a workflow framework.

Argo supports the suspend flag, the tricky part is that suspend is for the whole workflow, meaning a QueuedWorkload would need to represent the resources of the whole workflow all at once.

Ideally Argo should create jobs per sequential step, and then resource reservation happens one step at a time.

Aldo Culquicondor · Answer 1 · Sat Feb 26 2022 05:30:01 GMT+0800 (China Standard Time)

FYI @terrytangyuan

Also, extracted from a comment in https://bit.ly/kueue-apis (can't find the person's github)

A compromise might be a way of submitting a job, but have it "paused" so that the workflow manager can unpause it after its deps have been met, but the job still can wait in line in the queue so it doesn't add a lot of wall clock time. The scheduler would ignore any paused jobs until they are unpaused?

The idea is to allow for a dependent job to jump to the head of the queue when the dependencies are met.

kfox1111 · Answer 2 · Sat Feb 26 2022 06:09:26 GMT+0800 (China Standard Time)

Yes, but it essentially only jumps to the head of the line if it already was at the head of the line.

Yuan Tang · Answer 3 · Wed Mar 02 2022 01:56:37 GMT+0800 (China Standard Time)

I guess I'll have to read through the design doc for queue APIs in order to understand the use case better here. Any thoughts on what the integration looks like and how the two interoperate with each other?

kfox1111 · Answer 4 · Thu Mar 03 2022 03:32:20 GMT+0800 (China Standard Time)

Consider there to be two components. a queue, and a scheduler.
The queue is where jobs wait in line. A scheduler picks entries to work on at the head of the line.

Sometimes in the real world, its a family waiting in line. One member goes off to use the bathroom. If they are not back by the time its their turn, they usually say, "let the next folks go, we're not ready yet". The scheduler in this case just ignores that entry and goes to the next entry in the queue. The option to allow jobs to be "not ready yet, don't schedule me, but still queue me" could be interesting to various workflow managers.

Kubernetes Triage Robot · Answer 5 · Wed Jun 15 2022 23:19:10 GMT+0800 (China Standard Time)

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Kante Yin · Answer 6 · Thu Jun 16 2022 10:01:42 GMT+0800 (China Standard Time)

/remove-lifecycle stale

Kevin Hannon · Answer 7 · Mon Sep 05 2022 00:16:10 GMT+0800 (China Standard Time)

Would a similar integration like Argo and Volcano work in this case?

https://github.com/volcano-sh/volcano/blob/master/example/integrations/argo/20-job-DAG.yaml

Aldo Culquicondor · Answer 8 · Tue Sep 06 2022 22:49:29 GMT+0800 (China Standard Time)

Not really. That seems to be creating a different job for each step of the workflow. Then, each job enters the queues only after the previous step has finished. This can already be accomplished with Kueue and batch/v1.Job.

We would like to enhance the experience roughly as described here: #74 (comment)

Kubernetes Triage Robot · Answer 9 · Mon Dec 05 2022 23:15:59 GMT+0800 (China Standard Time)

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Kante Yin · Answer 10 · Tue Dec 06 2022 09:54:03 GMT+0800 (China Standard Time)

/remove-lifecycle stale

Kubernetes Triage Robot · Answer 11 · Mon Mar 06 2023 10:24:46 GMT+0800 (China Standard Time)

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Yuki Iwai · Answer 12 · Mon Mar 06 2023 10:26:07 GMT+0800 (China Standard Time)

/remove-lifecycle stale

Lee Bernick · Answer 13 · Wed Apr 12 2023 21:18:49 GMT+0800 (China Standard Time)

Hi, I am trying to figure out if I could use Kueue for queueing Tekton PipelineRuns (more info on tekton at tekton.dev/docs). From reading bit.ly/kueue-apis, it seems like Kueue is going to have separate controllers that create Workload objects for different types of workloads (although I'm not sure if that's the case yet).

Would it be reasonable to write a separate controller that creates Workload objects for pending PipelineRuns, and starts the PipelineRuns when the workload is admitted by the queue? I'm not sure if this is possible because it seems like kueue somehow mutates the workloads' node affinity directly, and the relationship between PipelineRuns and pod specs doesn't work in quite the same way as between Jobs and pod specs.

I'm also curious if it's possible to create a queue that is just based on count of running objects rather than their compute resource requirements.

More details on what I'm trying to do: https://github.com/tektoncd/community/blob/main/teps/0132-queueing-concurrent-runs.md

Aldo Culquicondor · Answer 14 · Thu Apr 13 2023 00:04:13 GMT+0800 (China Standard Time)

it seems like Kueue is going to have separate controllers that create Workload objects for different types of workloads (although I'm not sure if that's the case yet).

These controllers can live in the Kueue repo, the tekton repo or a new repo altogether.
We currently have a controller for kubeflow MPIJob in the kueue repo. If the Tekton community is open to have this integration, we can discuss where is the best place to put it.

Would it be reasonable to write a separate controller that creates Workload objects for pending PipelineRuns, and starts the PipelineRuns when the workload is admitted by the queue?

Depends on what you want. When talking about workflows, there are two possibilities: (a) queue the entire workflow or (b) queue the steps.

I'm not sure if this is possible because it seems like kueue somehow mutates the workloads' node affinity directly, and the relationship between PipelineRuns and pod specs doesn't work in quite the same way as between Jobs and pod specs.

Injecting node affinities is the mechanism to support fungibility (example: this job can run on ARM or x86, let kueue decide to run it where there is still quota). If this is not something that matters to you, you can not create flavors.

I'm also curious if it's possible to create a queue that is just based on count of running objects rather than their compute resource requirements.

Kueue is a quota-based system. Currently it uses pod resource requests and we plan to add number of pods #485.
What kind of object would make sense to count in Tekton? I would expect that there should be resource requests somewhere.

I'll comment more when I finish reading the doc above. Thanks for sharing :)

cc @kerthcet

Lee Bernick · Answer 15 · Thu Apr 13 2023 01:01:05 GMT+0800 (China Standard Time)

Thanks for your response!

These controllers can live in the Kueue repo, the tekton repo or a new repo altogether. We currently have a controller for kubeflow MPIJob in the kueue repo. If the Tekton community is open to have this integration, we can discuss where is the best place to put it.

Still in the early exploration phase, but looking forward to discussing more what would work!

Kueue is a quota-based system. Currently it uses pod resource requests and we plan to add number of pods #485. What kind of object would make sense to count in Tekton? I would expect that there should be resource requests somewhere.

Tekton uses PipelineRuns, which are DAGs of TaskRuns, and each TaskRun corresponds to a pod. One of our use cases is basically just to avoid overwhelming a kube cluster, in which case queueing based on resource requirements would be useful. However, there are some wrinkles with how we handle resource requirements, since we have containers running sequentially in a pod rather than in parallel, so the default k8s assumption that pod resource requirements are the sum of container resource requirements doesn't apply. For this reason, queueing based on TaskRun or PipelineRun count may be simpler for us. Since TaskRuns correspond to pods, queueing based on pod count would solve the TaskRun use case at least.

We also have some use cases that would probably need to be met in Tekton with a wrapper API (e.g. "I want to have only 5 PipelineRuns at a time of X Pipeline that communicates with a rate-limited service"; "I want to have only one deployment PipelineRun running at a time", etc). If we could use Kueue to create a queue of at most X TaskRuns, we'd be in good shape to design something in Tekton meeting these needs.

Aldo Culquicondor · Answer 16 · Thu Apr 13 2023 01:20:11 GMT+0800 (China Standard Time)

Since TaskRuns correspond to pods, queueing based on pod count would solve the TaskRun use case at least.

Yes, the pod count would help. But I would encourage users to also add pod requests. This is particularly important for HPC workflows. You might want dedicated CPUs and accelerators.

I agree that it wouldn't make sense to queue at a lower level than TaskRuns.

Aldo Culquicondor · Answer 17 · Fri Apr 28 2023 02:39:36 GMT+0800 (China Standard Time)

You are welcome to add a topic to our WG Batch meetings if you want to show your design proposals for queuing workflows.

https://docs.google.com/document/d/1XOeUN-K0aKmJJNq7H07r74n-mGgSFyiEDQ3ecwsGhec/edit

Kante Yin · Answer 18 · Fri Apr 28 2023 11:31:25 GMT+0800 (China Standard Time)

One feedback for this is we have Tekton+ArgoCD as our CICD pipelines, for cost effectiveness, we deploy tekton together with other application services(non-productive), what will happen is we will run into insufficient resources when there're a lot of CI runs. So we have to isolate them. Queueing is important for tekton as well I think.

Kante Yin · Answer 19 · Fri Apr 28 2023 12:01:36 GMT+0800 (China Standard Time)

We have waitForPodsReady which will wait until the previous job has enough pods running, I think we can expand this to like pendingForTargetQuantity, for job, it will still return the pod number, but for tekton, it will wait for target number of pipelineRuns/taskRuns, but we need to implement the suspend in pipelineRun/taskRun.

I think resource management is great for tekton, but if no, we can also make it out by watching the pipelineRun/taskRun amount. But this needs a refactor to kueue for now resources are required. Just for brainstorming.

Kante Yin · Answer 20 · Fri Apr 28 2023 12:04:10 GMT+0800 (China Standard Time)

Another concern is about preemption, I think it will be dangerous for tekton in some cases. Like deploying applications.

Yuan Tang · Answer 21 · Fri Dec 15 2023 09:16:45 GMT+0800 (China Standard Time)

@alculquicondor @ahg-g I added argoproj/argo-workflows#12363 to track and hopefully would attract more contributors to work on this.

Yuki Iwai · Answer 22 · Fri Dec 15 2023 16:15:08 GMT+0800 (China Standard Time)

@terrytangyuan FYI: we're working on kubernetes/kubernetes#121681 for workflow support.

Sam Leitch · Answer 23 · Wed Jan 03 2024 19:30:58 GMT+0800 (China Standard Time)

It is possible to use pod-level integration using the Plain Pods approach.

We use this config snippet (from kueue-manager-config) to integrate Argo Workflows into Kueue:

          integrations:
            frameworks:
            - "pod"
            podOptions:
              # You can change namespaceSelector to define in which
              # namespaces kueue will manage the pods.
              namespaceSelector:
                matchExpressions:
                - key: kubernetes.io/metadata.name
                  operator: NotIn
                  values: [ kube-system, kueue-system ]
              # Kueue uses podSelector to manage pods with particular
              # labels. The default podSelector will match all the pods.
              podSelector:
                matchExpressions:
                - key: workflows.argoproj.io/completed
                  operator: In
                  values: [ "false", "False", "no" ]

This configuration adds a scheduling gate to each Argo Workflows pod and will only release it once there is quota available.

Yuki Iwai · Answer 24 · Wed Jan 03 2024 19:41:56 GMT+0800 (China Standard Time)

It is possible to use pod-level integration using the Plain Pods approach.

We use this config snippet (from kueue-manager-config) to integrate Argo Workflows into Kueue:

          integrations:
            frameworks:
            - "pod"
            podOptions:
              # You can change namespaceSelector to define in which
              # namespaces kueue will manage the pods.
              namespaceSelector:
                matchExpressions:
                - key: kubernetes.io/metadata.name
                  operator: NotIn
                  values: [ kube-system, kueue-system ]
              # Kueue uses podSelector to manage pods with particular
              # labels. The default podSelector will match all the pods.
              podSelector:
                matchExpressions:
                - key: workflows.argoproj.io/completed
                  operator: In
                  values: [ "false", "False", "no" ]

This configuration adds a scheduling gate to each Argo Workflows pod and will only release it once there is quota available.

Thanks for putting an example here :)

Yes, that's right. The plain pod integration could potentially support the ArgoWorkflow.
However, the plain pod integration doesn't support all kueue features, such as partial admission. So the native ArgoWorkflkow support would be worth it.

Regarding the features not supported in the plain pod integration, please see for more details: https://github.com/kubernetes-sigs/kueue/tree/main/keps/976-plain-pods#non-goals

Aldo Culquicondor · Answer 25 · Thu Jan 04 2024 00:51:04 GMT+0800 (China Standard Time)

Oh that's cool. How do you set up the queue-name in the Pods?

I'm not familiar with Argo. Does it have support for pods working in parallel or pods that all need to start together?

Another thing to note is that behavior you are getting is that Pods are created when their dependencies complete. Meaning that, in a busy cluster, a workflow might be spending too much time waiting in the queue for each step. Is this acceptable?

It's probably acceptable for some users. Would you be willing to write a tutorial for the kueue website?

Sam Leitch · Answer 26 · Thu Jan 04 2024 06:21:04 GMT+0800 (China Standard Time)

Oh that's cool. How do you set up the queue-name in the Pods?

You can use either spec.template[].metadata or spec.podMetadata to define a queue.

I'm not familiar with Argo. Does it have support for pods working in parallel or pods that all need to start together?

Argo supports parallel execution of pods, and those pods are only created when each "node" of the workflow is ready to run.
This type of integration simply prevents each pod from executing until they pass Kueue's admission checks.

Another thing to note is that behavior you are getting is that Pods are created when their dependencies complete. Meaning that, in a busy cluster, a workflow might be spending too much time waiting in the queue for each step. Is this acceptable?

I'm still waiting to see how well it works. I don't expect the wait time between nodes to be a problem, but a backlog of partially complete workflows may become problematic.

Most of the use cases revolve around ETL nodes followed by process nodes and vice-versa. Depending on how the queues are configured, I could end up with too many partially complete workflows that take up ephemeral resources.

It's probably acceptable for some users. Would you be willing to write a tutorial for the kueue website?

Sure.

Kubernetes Triage Robot · Answer 27 · Wed Apr 03 2024 06:40:06 GMT+0800 (China Standard Time)

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Yuki Iwai · Answer 28 · Wed Apr 03 2024 15:00:07 GMT+0800 (China Standard Time)

/remove-lifecycle stale

GreenHand · Answer 29 · Thu Apr 25 2024 15:10:11 GMT+0800 (China Standard Time)

Is there any progress for supporting argo/tekon workflows?

Aldo Culquicondor · Answer 30 · Thu Apr 25 2024 21:52:10 GMT+0800 (China Standard Time)

I don't think anyone has followed through with it. Would you like to propose something?
I think we might require changes in both projects, but at least the Argo community is in favor of doing something: argoproj/argo-workflows#12363

Kevin Hannon · Answer 31 · Thu Apr 25 2024 22:05:12 GMT+0800 (China Standard Time)

@alculquicondor I'm confused. Isn't it possible to support argo-workflows indirectly through pod integration?

Aldo Culquicondor · Answer 32 · Fri Apr 26 2024 00:03:29 GMT+0800 (China Standard Time)

It is indeed possible. But a tighter integration, with atomic admission, would be beneficial.

GreenHand · Answer 33 · Fri Apr 26 2024 09:53:57 GMT+0800 (China Standard Time)

If the user want to run the step which contains multi pods only when all pods can run, we may need some methods to know which pods should be in the same workload. So only pod integration may not enough.

Kante Yin · Answer 34 · Mon Jun 03 2024 10:45:56 GMT+0800 (China Standard Time)

cc @Zhuzhenghao Discussion about integrating Kueue with tekton.