kubernetes / kubernetes

Production-Grade Container Scheduling and Management

Home Page:https://kubernetes.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Failing test due to pv expecting a "topology.gke.io/zone" label that is not set in OSS kubernetes nodes on GCE

jbtk opened this issue · comments

Which jobs are failing?

autoscaling e2e test: Kubernetes e2e suite.[It] [sig-autoscaling] Cluster size autoscaling [Slow] should increase cluster size if pod requesting volume is pending [Feature:ClusterSizeAutoscalingScaleUp]

Which tests are failing?

Kubernetes e2e suite.[It] [sig-autoscaling] Cluster size autoscaling [Slow] should increase cluster size if pod requesting volume is pending [Feature:ClusterSizeAutoscalingScaleUp]

Since when has it been failing?

Not sure, it is failing for the whole time in the testgrid

Testgrid link

https://testgrid.k8s.io/sig-autoscaling-cluster-autoscaler#gci-gce-autoscaling

Reason for failure (if possible)

The scheduler rewrites the PV requirements from InTree to CSI requiring a label "topology.gke.io/zone" that is not set on nodes that are running in OSS kubernetes (started with kube up script).

Starting the cluster command:
kubetest2 gce -v 2 --repo-root ~/src/k8s.io/kubernetes --gcp-project --legacy-mode --build --up --env=ENABLE_CUSTOM_METRICS=true --env=KUBE_ENABLE_CLUSTER_AUTOSCALER=true --env=KUBE_AUTOSCALER_MIN_NODES=3 --env=KUBE_AUTOSCALER_MAX_NODES=6 --env=KUBE_AUTOSCALER_ENABLE_SCALE_DOWN=true --env=KUBE_ADMISSION_CONTROL=NamespaceLifecycle,LimitRanger,ServiceAccount,ResourceQuota,Priority --env=ENABLE_POD_PRIORITY=true

The problematic code seems to be here: https://github.com/kubernetes/csi-translation-lib/blob/master/plugins/gce_pd.go#L257

I see what is the problem, but it is not clear for me what should be the correct behavior. It seems that in GKE this label is actually set on the node.

The labels that I see on the node of my cluster:
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=e2-standard-2
beta.kubernetes.io/os=linux
cloud.google.com/metadata-proxy-ready=true
failure-domain.beta.kubernetes.io/region=us-central1
failure-domain.beta.kubernetes.io/zone=us-central1-b
kubernetes.io/arch=amd64
kubernetes.io/hostname=kt2-1b77b5e4-87ae-minion-group-5pgr
kubernetes.io/os=linux
node.kubernetes.io/instance-type=e2-standard-2
topology.kubernetes.io/region=us-central1
topology.kubernetes.io/zone=us-central1-b

What the test is doing:

  • the test creates a PD on GCE
  • connects a PV to it
  • creates a PVC
  • tries to schedule a pod that requires this PV

Anything else we need to know?

No response

Relevant SIG(s)

/sig-storage

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

/sig storage