google / xpk

xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wrong output value of TPUVMs when the cluster have the string tpu in its name

williampispico opened this issue · comments

Hi,

I noticed that when I run python3 xpk.py cluster describe --cluster <MYCLUSTER> the number that returns in the message can be wrong if you name your cluster with the string tpu in its name (e.g., my-tpu-cluster). The command kubectl get node --no-headers=true | grep '\-tpu\-' | wc -l would fetch even the default-node and consider it a TPU node and show a wrong value in the output.

Here is the line in xpy.py I am mentioning https://github.com/google/xpk/blob/main/xpk.py#L1186

A workaround for that could be instead kubectl get node --no-headers=true --selector='cloud.google.com/gke-tpu-accelerator' | wc -l

Thank you William for the bug and fix! Will get on this shortly. :)