k8snetworkplumbingwg / whereabouts

A CNI IPAM plugin that assigns IP addresses cluster-wide

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ip-reconciler race condition with Whereabouts, leads to IP cleanup and duplicate IPs

xagent003 opened this issue · comments

I'm seeing another duplicate IP issue caused by an accidental IP cleanup, it appears the ip-reconciler CronJob has a race condition with whereabouts allocating IPs. I see whereabouts allocates a certain IP to a Pod, Patches the IPPool CR, and updates resourceversion... all good so far.

However, if the ip-reconciler Job runs at the right time (Change the schedule to 1 minute, to increase the odds), it gets the updated IPPool with new allocation. But when it fetches the Pod, the IP has not yet been updated on the Pod. This is maybe due to kubelet or k8s still processing the container and not adding the network status w/ IPs and macs. So the ip-reconciler sends a "remove" Patch operation.

We can see that whereabouts on a node updated the Pool with allocation. I added my own logs in storage/kubernetes/ipam.go to print the CR to get the ResourceVersion and Spec. We can see it allocated for Pod web-2:


2021-11-06T03:38:11Z [debug] PF9: re-getting IPPool...
2021-11-06T03:38:11Z [debug] PF9: GetIpPool: &{TypeMeta:{Kind: APIVersion:} ObjectMeta:{Name:10.128.40.0-24 GenerateName: Namespace:default SelfLink:/apis/whereabouts.cni.cncf.io/v1alpha1/namespaces/default/ippools/10.128.40.0-24 UID:b173c049-dc3d-4a47-8409-53c52c3a2100 ResourceVersion:733533 Generation:204 CreationTimestamp:2021-11-03 18:59:37 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[] Annotations:map[] OwnerReferences:[] Finalizers:[] ClusterName: ManagedFields:[{Manager:whereabouts Operation:Update APIVersion:whereabouts.cni.cncf.io/v1alpha1 Time:2021-11-06 03:38:11 +0000 UTC FieldsType:FieldsV1 FieldsV1:{"f:spec":{".":{},"f:allocations":{".":{},"f:45":{".":{},"f:id":{},"f:podref":{}},"f:46":{".":{},"f:id":{},"f:podref":{}},"f:47":{".":{},"f:id":{},"f:podref":{}}},"f:range":{}}}}]} Spec:{Range:10.128.40.0/24 Allocations:map[45:{ContainerID:0b403a8e43c6183225aefb128dc187535bdbacea2a63159d0679705535f6ada1 PodRef:default/web-0} 46:{ContainerID:b565baad124e2fb57d883f33a0019048354abca904bfd9f7ae18123bbe118779 PodRef:default/web-1} 47:{ContainerID:be4726eb3c4f6b092f2c93e531fe25f9a843c0876c07edddd2fc93fe653bcea2 PodRef:default/web-2}]}}
2021-11-06T03:38:11Z [debug] PF9: AFTER PATCH Allocations: [IP: 10.128.40.45 is reserved for pod: default/web-0 IP: 10.128.40.46 is reserved for pod: default/web-1 IP: 10.128.40.47 is reserved for pod: default/web-2]

At almost the same time, CronJob ran, and from custom logs I added, I see it got ResourceVersion 733533 (the one just written by whereabouts above), but the Pod has no IP annotations yet, so it removes it:


2021-11-06T03:38:11Z [debug] pod reference default/web-0 matches allocation; Allocation IP: 10.128.40.45; PodIPs: map[10.128.165.32:{} 10.128.40.45:{}]
2021-11-06T03:38:11Z [debug] pod reference default/web-1 matches allocation; Allocation IP: 10.128.40.46; PodIPs: map[10.128.165.33:{} 10.128.40.46:{}]
2021-11-06T03:38:11Z [debug] pod reference default/web-2 matches allocation; Allocation IP: 10.128.40.47; PodIPs: map[]
2021-11-06T03:38:11Z [debug] pod ref default/web-2 is not listed in the live pods list

2021-11-06T03:38:11Z [debug] PF9: patch = [{remove /spec/allocations/34 <nil>}]
2021-11-06T03:38:11Z [debug] PF9: wrote patchdata: [{"op":"test","path":"/metadata/resourceVersion","value":"733528"},{"op":"remove","path":"/spec/allocations/34"}]
2021-11-06T03:38:11Z [debug] Going to update the reserve list to: [IP: 10.128.40.45 is reserved for pod: default/web-0 IP: 10.128.40.46 is reserved for pod: default/web-1]
2021-11-06T03:38:11Z [debug] PF9: patch = [{remove /spec/allocations/47 <nil>}]
2021-11-06T03:38:11Z [debug] PF9: wrote patchdata: [{"op":"test","path":"/metadata/resourceVersion","value":"733533"},{"op":"remove","path":"/spec/allocations/47"}]
2021-11-06T03:38:11Z [debug] successfully cleanup IPs: [10.128.165.34 10.128.40.47]

So IPs 10.128.165.34 10.128.40.47 got cleaned up from the IPPool (Pods attach to two networks, issue is hit for both attachments). Next Pod in the StatefulSet gets scheduled, and whereabouts allocates those same IPs resulting in the two Pods having duplicate IP as well.

### I think the root cause is this:

**Any reason the IsPodAlive() function checks if the IP is present in the Pod... rather than only doing the cleanup if the Pod is... not Alive? As I understand the reconciler was for node crash resulting in orphaned IPs. In which case the Pod shouldn't exist.

_, isFound := livePodIPs[ip]
**

This is also more easily reproducible if using a statefulSet as that seems to schedule Pods serially a few sec apart, at least on our setup. So create a StatefulSet with several Pods, and let that run and set the ip-reconciler CronJob schedule to 1 minute.

Since the trigger here is CronJob running right at the same time one of the whereabouts is allocating IPs, having the CronJob run more often increases the odds.

You won't hit the issue everytime. Delete and create the STS, checking the IPPool CR for a missing IP/Podref, or checking two of the Pods for a duplicate IP.

Here was my STS YAML, nothing fancy, just an nginx:

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  selector:
    matchLabels:
      app: nginx # has to match .spec.template.metadata.labels
  serviceName: "nginx"
  replicas: 5 # by default is 1
  template:
    metadata:
      annotations:
          k8s.v1.cni.cncf.io/networks: whereabouts-conf, whereabouts-conf2
      labels:
        app: nginx # has to match .spec.selector.matchLabels
    spec:
      terminationGracePeriodSeconds: 10
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: nginx
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web

### I think the root cause is this:

**Any reason the IsPodAlive() function checks if the IP is present in the Pod... rather than only doing the cleanup if the Pod is... not Alive? As I understand the reconciler was for node crash resulting in orphaned IPs. In which case the Pod shouldn't exist.

The reasons for which we've decided to actually check if pod X really owns the address can be checked in
#118 (comment)

We could probably remove this condition (check if the IP is present in the pod) if we start relying on pod UIDs rather than names.

... which, afaiu is (right now) not an option since you can't GET by UIDs: kubernetes/kubernetes#20572

_, isFound := livePodIPs[ip]

**

"so it might happen that a StatefulSet Pod got new IP in the meantime, while previous IP is still reserved in IPPool pointing to the same running Pod"

Do you know under what case that other bug may happen? I was under assumption the reocnciler was for crashed/offine nodes resulting in orpaned IPs. In which case the Pod does not exist anymore.

Otherwise, I was thinking we could only do this IP check for Pods that have Phase: Running in their status(ignore the IP check for any pods in a transient or bad state). Or, we skip this check only for Pods that have Phase: Pending

Based on this, the possible phases are: Pending, Running, Succeeded, Failed, and Unknown: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/

From what Ive seen, at least for this issue, the STS Pods are in "Pending" phase as kubelet is still bringing up the Pod.

Let me know what you think? @maiqueb @dougbtv

"so it might happen that a StatefulSet Pod got new IP in the meantime, while previous IP is still reserved in IPPool pointing to the same running Pod"

Do you know under what case that other bug may happen? I was under assumption the reocnciler was for crashed/offine nodes resulting in orpaned IPs. In which case the Pod does not exist anymore.

If (for whatever reason) the pod gets removed (not gracefully, i.e. the CNI del is not sent), the allocation will linger in the IPPool. The stateful set will then create another pod - having the same name - that will be assigned a new allocation.

Otherwise, I was thinking we could only do this IP check for Pods that have Phase: Running in their status(ignore the IP check for any pods in a transient or bad state). Or, we skip this check only for Pods that have Phase: Pending

Based on this, the possible phases are: Pending, Running, Succeeded, Failed, and Unknown: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/

From what Ive seen, at least for this issue, the STS Pods are in "Pending" phase as kubelet is still bringing up the Pod.

Let me know what you think? @maiqueb @dougbtv

@xagent003 I think you're right: if we skip the IP address liveness check in the pod's status for Pending pods, we quite probably fix the race you're mentioning.