google / xpk

xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Make XPK Handle multiple slice sizes

rwitten opened this issue · comments

N queues, 1 per slice size, 1 cluster.

(This is complicated!)

From my understanding I believe this would be multiple resource flavors with their respective chip sizes still in 1 ClusterQueue. https://kubernetes.io/blog/2022/10/04/introducing-kueue/#example-use-case

maybe looks something like this:

xpk cluster create --cluster=my-cluster --tpu-types v5p-128,5 v5p-256,5   # type, num_slices
  • make sure that cluster create when rerunning the command is aware of the heterogenousness of the cluster

One aspect to this is that the overall goal here is to allow the gke cluster to figure out the needed slice types based on a set of user provided options: chip-budget and what incoming requests are.

Probably can also set some minimum / starting points.