Failing test due to pv expecting a "topology.gke.io/zone" label that is not set in OSS kubernetes nodes on GCE

Question

Failing test due to pv expecting a "topology.gke.io/zone" label that is not set in OSS kubernetes nodes on GCE

jbtk opened this issue 16 days ago · comments

Which jobs are failing?

autoscaling e2e test: Kubernetes e2e suite.[It] [sig-autoscaling] Cluster size autoscaling [Slow] should increase cluster size if pod requesting volume is pending [Feature:ClusterSizeAutoscalingScaleUp]

Which tests are failing?

Kubernetes e2e suite.[It] [sig-autoscaling] Cluster size autoscaling [Slow] should increase cluster size if pod requesting volume is pending [Feature:ClusterSizeAutoscalingScaleUp]

Since when has it been failing?

Not sure, it is failing for the whole time in the testgrid

Testgrid link

https://testgrid.k8s.io/sig-autoscaling-cluster-autoscaler#gci-gce-autoscaling

Reason for failure (if possible)

The scheduler rewrites the PV requirements from InTree to CSI requiring a label "topology.gke.io/zone" that is not set on nodes that are running in OSS kubernetes (started with kube up script).

Starting the cluster command:
kubetest2 gce -v 2 --repo-root ~/src/k8s.io/kubernetes --gcp-project --legacy-mode --build --up --env=ENABLE_CUSTOM_METRICS=true --env=KUBE_ENABLE_CLUSTER_AUTOSCALER=true --env=KUBE_AUTOSCALER_MIN_NODES=3 --env=KUBE_AUTOSCALER_MAX_NODES=6 --env=KUBE_AUTOSCALER_ENABLE_SCALE_DOWN=true --env=KUBE_ADMISSION_CONTROL=NamespaceLifecycle,LimitRanger,ServiceAccount,ResourceQuota,Priority --env=ENABLE_POD_PRIORITY=true

The problematic code seems to be here: https://github.com/kubernetes/csi-translation-lib/blob/master/plugins/gce_pd.go#L257

I see what is the problem, but it is not clear for me what should be the correct behavior. It seems that in GKE this label is actually set on the node.

The labels that I see on the node of my cluster:
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=e2-standard-2
beta.kubernetes.io/os=linux
cloud.google.com/metadata-proxy-ready=true
failure-domain.beta.kubernetes.io/region=us-central1
failure-domain.beta.kubernetes.io/zone=us-central1-b
kubernetes.io/arch=amd64
kubernetes.io/hostname=kt2-1b77b5e4-87ae-minion-group-5pgr
kubernetes.io/os=linux
node.kubernetes.io/instance-type=e2-standard-2
topology.kubernetes.io/region=us-central1
topology.kubernetes.io/zone=us-central1-b

What the test is doing:

the test creates a PD on GCE
connects a PV to it
creates a PVC
tries to schedule a pod that requires this PV

Anything else we need to know?

No response

Relevant SIG(s)

/sig-storage

Kubernetes Prow Robot · Answer 1 · Tue May 07 2024 19:26:48 GMT+0800 (China Standard Time)

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Justyna Betkier · Answer 2 · Tue May 07 2024 19:28:49 GMT+0800 (China Standard Time)

/sig storage