Failing test due to pv expecting a "topology.gke.io/zone" label that is not set in OSS kubernetes nodes on GCE
jbtk opened this issue · comments
Which jobs are failing?
autoscaling e2e test: Kubernetes e2e suite.[It] [sig-autoscaling] Cluster size autoscaling [Slow] should increase cluster size if pod requesting volume is pending [Feature:ClusterSizeAutoscalingScaleUp]
Which tests are failing?
Kubernetes e2e suite.[It] [sig-autoscaling] Cluster size autoscaling [Slow] should increase cluster size if pod requesting volume is pending [Feature:ClusterSizeAutoscalingScaleUp]
Since when has it been failing?
Not sure, it is failing for the whole time in the testgrid
Testgrid link
https://testgrid.k8s.io/sig-autoscaling-cluster-autoscaler#gci-gce-autoscaling
Reason for failure (if possible)
The scheduler rewrites the PV requirements from InTree to CSI requiring a label "topology.gke.io/zone" that is not set on nodes that are running in OSS kubernetes (started with kube up script).
Starting the cluster command:
kubetest2 gce -v 2 --repo-root ~/src/k8s.io/kubernetes --gcp-project --legacy-mode --build --up --env=ENABLE_CUSTOM_METRICS=true --env=KUBE_ENABLE_CLUSTER_AUTOSCALER=true --env=KUBE_AUTOSCALER_MIN_NODES=3 --env=KUBE_AUTOSCALER_MAX_NODES=6 --env=KUBE_AUTOSCALER_ENABLE_SCALE_DOWN=true --env=KUBE_ADMISSION_CONTROL=NamespaceLifecycle,LimitRanger,ServiceAccount,ResourceQuota,Priority --env=ENABLE_POD_PRIORITY=true
The problematic code seems to be here: https://github.com/kubernetes/csi-translation-lib/blob/master/plugins/gce_pd.go#L257
I see what is the problem, but it is not clear for me what should be the correct behavior. It seems that in GKE this label is actually set on the node.
The labels that I see on the node of my cluster:
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=e2-standard-2
beta.kubernetes.io/os=linux
cloud.google.com/metadata-proxy-ready=true
failure-domain.beta.kubernetes.io/region=us-central1
failure-domain.beta.kubernetes.io/zone=us-central1-b
kubernetes.io/arch=amd64
kubernetes.io/hostname=kt2-1b77b5e4-87ae-minion-group-5pgr
kubernetes.io/os=linux
node.kubernetes.io/instance-type=e2-standard-2
topology.kubernetes.io/region=us-central1
topology.kubernetes.io/zone=us-central1-b
What the test is doing:
- the test creates a PD on GCE
- connects a PV to it
- creates a PVC
- tries to schedule a pod that requires this PV
Anything else we need to know?
No response
Relevant SIG(s)
/sig-storage
This issue is currently awaiting triage.
If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/sig storage