kube-scheduler pod goes into CrashLoopBackOff status because of incorrect arguments
SohamChakraborty opened this issue · comments
/kind bug
1. What kops
version are you running? The command kops version
, will display
this information.
$ kops version
Client version: 1.25.3 (git-v1.25.3)
2. What Kubernetes version are you running? kubectl version
will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops
flag.
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:58:47Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.15", GitCommit:"b84cb8ab29366daa1bba65bc67f54de2f6c34848", GitTreeState:"clean", BuildDate:"2022-12-08T10:42:57Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"}
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
We ran kops rolling-update cluster <cluster_name> --yes --cloudonly
to re-create the nodes. The new master node that came up didn't have a running kube-scheduler
pod.
5. What happened after the commands executed?
New nodes (master and worker) came up but master node stayed in NotReady
status as CNI configuration was not found in /etc/cni/net.d
directory. On further investigation, it was found that kube-scheduler
pod was not running.
6. What did you expect to happen?
Master node will come up automatically in a healthy, functioning state with all kube-system pods running.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
kind: Cluster
metadata:
creationTimestamp: null
generation: 68
name: k8s.foobar.com
spec:
api:
loadBalancer:
class: Network
type: Public
authorization:
rbac: {}
channel: stable
cloudLabels:
App: k8s
Env:
Region: us-east-1
cloudProvider: aws
clusterAutoscaler:
awsUseStaticInstanceList: false
balanceSimilarNodeGroups: false
cpuRequest: 100m
enabled: true
expander: least-waste
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.23.1
memoryRequest: 300Mi
newPodScaleUpDelay: 0s
scaleDownDelayAfterAdd: 10m0s
scaleDownUnneededTime: 5m0s
scaleDownUnreadyTime: 10m0s
scaleDownUtilizationThreshold: "0.6"
skipNodesWithLocalStorage: true
skipNodesWithSystemPods: true
configBase: s3://com.foobar.k8s-state/k8s.foobar.com
containerRuntime: docker
dnsZone: 123456
docker:
experimental: true
ipMasq: false
ipTables: false
logDriver: json-file
logLevel: info
logOpt:
- max-size=10m
- max-file=5
storage: overlay2
etcdClusters:
- cpuRequest: 200m
etcdMembers:
- instanceGroup: master-us-east-1a
name: a
memoryRequest: 100Mi
name: main
- cpuRequest: 100m
etcdMembers:
- instanceGroup: master-us-east-1a
name: a
memoryRequest: 100Mi
name: events
fileAssets:
- content: |
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
name: audit-policy-config
path: /var/log/audit/policy-config.yaml
roles:
- Master
iam:
allowContainerRegistry: true
legacy: false
kubeAPIServer:
auditLogMaxAge: 10
auditLogMaxBackups: 1
auditLogMaxSize: 100
auditLogPath: /var/log/kube-apiserver-audit.log
auditPolicyFile: /var/log/audit/policy-config.yaml
auditWebhookBatchMaxWait: 5s
auditWebhookConfigFile: /var/log/audit/webhook-config.yaml
kubeDNS:
provider: CoreDNS
kubeScheduler:
usePolicyConfigMap: true
kubelet:
anonymousAuth: false
authenticationTokenWebhook: true
authorizationMode: Webhook
maxPods: 150
shutdownGracePeriod: 1m0s
shutdownGracePeriodCriticalPods: 30s
kubernetesApiAccess:
- 0.0.0.0/0
kubernetesVersion: 1.23.15
masterInternalName: api.internal.k8s.foobar.com
masterPublicName: api.k8s.foobar.com
networkCIDR: 10.4.0.0/16
networkID: vpc-123456
networking:
calico:
awsSrcDstCheck: Disable
encapsulationMode: ipip
ipipMode: CrossSubnet
wireguardEnabled: true
nonMasqueradeCIDR: 100.64.0.0/10
rollingUpdate:
maxSurge: 4
sshAccess:
- 0.0.0.0/0
subnets:
<SNIPPED>
topology:
dns:
type: Private
masters: private
nodes: private
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2019-12-18T06:34:20Z"
generation: 13
labels:
kops.k8s.io/cluster: k8s.foobar.com
name: master-us-east-1a
spec:
image: ubuntu/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20221212
instanceMetadata:
httpPutResponseHopLimit: 2
httpTokens: required
machineType: c5a.xlarge
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-us-east-1a
role: Master
rootVolumeEncryption: true
rootVolumeSize: 30
subnets:
- us-east-1a
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2019-12-18T06:34:21Z"
generation: 75
labels:
kops.k8s.io/cluster: k8s.foobar.com
name: nodes-us-east-1a
spec:
additionalUserData:
- content: |
apt-get update
apt-get install -y qemu-user-static
name: 0prereqs.sh
type: text/x-shellscript
cloudLabels:
k8s.io/cluster-autoscaler/enabled: ""
k8s.io/cluster-autoscaler/k8s.foobar.com: ""
externalLoadBalancers:
- targetGroupArn: <ELB_ARN>
image: ubuntu/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20221212
instanceMetadata:
httpPutResponseHopLimit: 2
httpTokens: required
machineType: m5a.4xlarge
maxSize: 7
minSize: 2
mixedInstancesPolicy:
instances:
- m5a.4xlarge
- m5.4xlarge
- m5d.4xlarge
- m5ad.4xlarge
- r5.4xlarge
- r5a.4xlarge
- r4.4xlarge
- r5d.4xlarge
- i3.4xlarge
- r5ad.4xlarge
- r5.8xlarge
onDemandAboveBase: 0
onDemandBase: 0
spotAllocationStrategy: capacity-optimized
nodeLabels:
kops.k8s.io/instancegroup: nodes-us-east-1a
role: Node
rootVolumeEncryption: true
rootVolumeSize: 100
subnets:
- us-east-1a
8. Please run the commands with most verbose logging by adding the -v 10
flag.
Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?
State of kube-scheduler
pod:
kube-scheduler-ip-1-2-3-4.us-east-1.compute.internal 0/1 CrashLoopBackOff 5 (70s ago) 3m26s
Logs showed this at the end:
Error: unknown flag: --policy-configmap-namespace
2024/02/26 18:18:33 running command: exit status 1
The kube-scheduler
manifest was this:
$ cat /etc/kubernetes/manifests/kube-scheduler.manifest
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
k8s-app: kube-scheduler
name: kube-scheduler
namespace: kube-system
spec:
containers:
- args:
- --log-file=/var/log/kube-scheduler.log
- --also-stdout
- /usr/local/bin/kube-scheduler
- --authentication-kubeconfig=/var/lib/kube-scheduler/kubeconfig
- --authorization-kubeconfig=/var/lib/kube-scheduler/kubeconfig
- --config=/var/lib/kube-scheduler/config.yaml
- --feature-gates=CSIMigrationAWS=true,InTreePluginAWSUnregister=true
- --leader-elect=true
- --policy-configmap-namespace=kube-system
- --policy-configmap=scheduler-policy
- --tls-cert-file=/srv/kubernetes/kube-scheduler/server.crt
- --tls-private-key-file=/srv/kubernetes/kube-scheduler/server.key
- --v=2
command:
- /go-runner
image: registry.k8s.io/kube-scheduler:v1.23.15@sha256:9accf0bab7275b3a7704f5fcbc27d7a7820ce9209cffd4634214cfb4536fa4ca
livenessProbe:
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 15
timeoutSeconds: 15
name: kube-scheduler
resources:
requests:
cpu: 100m
volumeMounts:
- mountPath: /var/lib/kube-scheduler
name: varlibkubescheduler
readOnly: true
- mountPath: /srv/kubernetes/kube-scheduler
name: srvscheduler
readOnly: true
- mountPath: /var/log/kube-scheduler.log
name: logfile
hostNetwork: true
priorityClassName: system-cluster-critical
tolerations:
- key: CriticalAddonsOnly
operator: Exists
volumes:
- hostPath:
path: /var/lib/kube-scheduler
name: varlibkubescheduler
- hostPath:
path: /srv/kubernetes/kube-scheduler
name: srvscheduler
- hostPath:
path: /var/log/kube-scheduler.log
name: logfile
status: {}
Had to remove both --policy-configmap-namespace=kube-system
and --policy-configmap=scheduler-policy
to get kube-schduler
pod to run. The manifest after changing is:
$ cat /etc/kubernetes/manifests/kube-scheduler.manifest
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
k8s-app: kube-scheduler
name: kube-scheduler
namespace: kube-system
spec:
containers:
- args:
- --log-file=/var/log/kube-scheduler.log
- --also-stdout
- /usr/local/bin/kube-scheduler
- --authentication-kubeconfig=/var/lib/kube-scheduler/kubeconfig
- --authorization-kubeconfig=/var/lib/kube-scheduler/kubeconfig
- --config=/var/lib/kube-scheduler/config.yaml
- --feature-gates=CSIMigrationAWS=true,InTreePluginAWSUnregister=true
- --leader-elect=true
- --tls-cert-file=/srv/kubernetes/kube-scheduler/server.crt
- --tls-private-key-file=/srv/kubernetes/kube-scheduler/server.key
- --v=2
command:
- /go-runner
image: registry.k8s.io/kube-scheduler:v1.23.15@sha256:9accf0bab7275b3a7704f5fcbc27d7a7820ce9209cffd4634214cfb4536fa4ca
livenessProbe:
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 15
timeoutSeconds: 15
name: kube-scheduler
resources:
requests:
cpu: 100m
volumeMounts:
- mountPath: /var/lib/kube-scheduler
name: varlibkubescheduler
readOnly: true
- mountPath: /srv/kubernetes/kube-scheduler
name: srvscheduler
readOnly: true
- mountPath: /var/log/kube-scheduler.log
name: logfile
hostNetwork: true
priorityClassName: system-cluster-critical
tolerations:
- key: CriticalAddonsOnly
operator: Exists
volumes:
- hostPath:
path: /var/lib/kube-scheduler
name: varlibkubescheduler
- hostPath:
path: /srv/kubernetes/kube-scheduler
name: srvscheduler
- hostPath:
path: /var/log/kube-scheduler.log
name: logfile
status: {}
@SohamChakraborty Thank you for reporting this issue. The fix should be part of the future releases.
Please also remove kubeScheduler.usePolicyConfigMap
from your config. That should fix the problem long term.
Thank you @hakman for fixing this quick.