kubernetes / kubernetes

I've seen this when implementing Calico NetworkPolicy against the Kubernetes API (as opposed to via Calico's etcd)

Since the Pod status reporting is not synchronous with Pod set up, it often takes a long time for the API to get updated with the IP address of a newly networked Pod, which means that it can take multiple seconds before any NetworkPolicy implementation based off the k8s API learns about the new Pod's assigned IP.

I've seen this have an impact on 99%ile time to first connectivity being measured in seconds (because it takes seconds for kubelet -> apiserver -> controller to occur).

A naive fix would be to write the pod status immediately after CNI execution completes based on the result returned by the CNI plugin (though this might have performance impacts due to increased number of writes to the API).

CC @thockin @freehan @dchen1107 @wojtek-t

@kubernetes/sig-node-misc
@kubernetes/sig-network-misc

This is interesting. to consider. We get used to the system responding ~INSTANTLY, and when it doesn't we aren't happy.

@thockin And this actually happens for all network plugins. In this case, it should be fixed.

A naive fix would be to write the pod status immediately after CNI execution completes based on the result returned by the CNI plugin (though this might have performance impacts due to increased number of writes to the API).

NB, we have already pushed network plugins to runtime, so it's not easy to write pod status immediately because kubelet doesn't know CNI.

We could change NetworkPlugin.SetUpPod() to return (*PodNetworkStatus, error), and for plugins other than CNI, just have SetUpPod() end with "return plugin.GetPodNetworkStatus(...)"

@danwinship As feiskyer implies, kubelet does not know network plugin and it relies on a independent loop to update pod status, while SetUpPod() is executed in another main loop. We may have to change CRI and build a channel between those loops to update pod status immediately ...

Another approach is to improve this reporting path "at best". @caseydavenport Is it possible from you side to figure out which phase cause so much time? Give kubelet --v=4 and pay attention to this call would help to track.

And do you need 100% immediate status report, or just the faster the better? (I guess it's later, then the second approach would be preferred)

@resouer Yep, I'll give that a try. I'm traveling / vacationing right now, so might not get back here until the new year :)

And do you need 100% immediate status report, or just the faster the better?

The issue is that if it takes 3-5 seconds for connectivity to appear after the pod has been started, some clients that try to connect at startup will fail, and not gracefully. So unless someone has written their app specifically to handle this you can end up with a broken pod.

So, it essentially needs to be immediate.

IIUC, This updates pod status to api server. Probably need to do it in containerRuntime.SyncPod.

Can we identify the bottleneck first and define what "acceptable" latency is? "Immediate" seems a little vague to me...

AFAIK, there are two known, relatively significant latency, in the critical path.

kubelet polls the container runtime every 1s to detect changes. Shortening this needs more profiling and extra care as it could cause a significant increase in resource usage and latency. If a change is detected, kubelet sequentially inspects the pods to update the internal podstauts cache. This could take longer (a few seconds) if many pods change in one iteration, and can be optimized by parallelizing the inspections.
kubelet reporting the status to the apiserver. This is essentially limited by the QPS limit of the apiserver client kubelet uses. In the case of batch creation of pods, this would become the bottleneck. See #14955 for the potential improvement on this.

I am specifically concerned about inserting random calls to force a status update for (1). These arbitrary updates are hard to track and may cause kubelet/container runtime to be overwhelmed at times. We had a pretty bad experience in the past where each goroutine individually queried the container runtime and caused only slower operations in general. I think relying on "events" reported by the container runtime would be a better solution, but that requires a significant amount of work.

@yujuhong maybe take a look at if #26538 can help?

Sorry it's taken me so long to get back to this issue -

I did a little bit of digging into the kubelet logs and I've found what appears to be some strange behavior. I'm following the status_manager logs for the pod client0-cali-ns-0-3203360797-3mk9m with UID 1c4d94dd-39ad-11e7-a787-42010a800122. This is running on k8s v1.6.2.

Right after the Pod is started, a status update is queued including the newly assigned IP address.

May 15 20:28:56 casey-test-default-pool-b0d6f3ea-wqtj kubelet[6842]: I0515 20:28:56.047043    6842 status_manager.go:340] Status Manager: adding pod: "1c4d94dd-39ad-11e7-a787-42010a800122", with status: ('\x02', {Running [{Initialized True 0001-01-01 00:00:00 +       0000 UTC 2017-05-15 20:28:52 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2017-05-15 20:28:56 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2017-05-15 20:28:52 +0000 UTC  }]   10.240.0.45 10.180.13.22 2017-05-15 20:28:52 +0000 UTC [] [{ttf       p {nil &ContainerStateRunning{StartedAt:2017-05-15 20:28:55 +0000 UTC,} nil} {nil nil nil} true 0 calico/ttfp:v0.6.0 docker://sha256:41a315c2fae94ab27e4242ff822b86641096208df1de4634f7f339aea9934e70 docker://93b563fd43b7805261ccc20d22e1bb3a339d82dc0ee88834410f4f6af0       4e1169}] Burstable}) to podStatusChannel

1 second later I see a log indicating that the status update was pulled off the queue and is being ignored because it hasn’t changed.

May 15 20:28:57 casey-test-default-pool-b0d6f3ea-wqtj kubelet[6842]: I0515 20:28:57.369531    6842 status_manager.go:325] Ignoring same status for pod "client0-cali-ns-0-3203360797-3mk9m_cali-ns-0(1c4d94dd-39ad-11e7-a787-42010a800122)", status: {Phase:Running C       onditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2017-05-15 20:28:52 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2017-05-15 20:28:56 +0000        UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2017-05-15 20:28:52 +0000 UTC Reason: Message:}] Message: Reason: HostIP:10.240.0.45 PodIP:10.180.13.22 StartTime:2017-05-15 20:28:52 +0000 UTC InitC       ontainerStatuses:[] ContainerStatuses:[{Name:ttfp State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2017-05-15 20:28:55 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:calico/ttfp:v       0.6.0 ImageID:docker://sha256:41a315c2fae94ab27e4242ff822b86641096208df1de4634f7f339aea9934e70 ContainerID:docker://93b563fd43b7805261ccc20d22e1bb3a339d82dc0ee88834410f4f6af04e1169}] QOSClass:Burstable}

About 30 seconds later, I see a the status manager get an update in a syncbatch:

May 15 20:29:26 casey-test-default-pool-b0d6f3ea-wqtj kubelet[6842]: I0515 20:29:26.695017    6842 status_manager.go:410] Status Manager: syncPod in syncbatch. pod UID: "1c4d94dd-39ad-11e7-a787-42010a800122"

Then, I see a log indicating that the status was finally updated:

May 15 20:29:27 casey-test-default-pool-b0d6f3ea-wqtj kubelet[6842]: I0515 20:29:27.099492    6842 status_manager.go:444] Status for pod "client0-cali-ns-0-3203360797-3mk9m_cali-ns-0(1c4d94dd-39ad-11e7-a787-42010a800122)" updated successfully: {status:{Phase:Ru       nning Conditions:[{Type:Initialized Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x4d1fec0}} LastTransitionTime:{Time:{sec:63630476932 nsec:0 loc:0x4d1fec0}} Reason: Message:} {Type:Ready Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x4d1fec0}} LastTransi       tionTime:{Time:{sec:63630476936 nsec:0 loc:0x4d1fec0}} Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x4d1fec0}} LastTransitionTime:{Time:{sec:63630476932 nsec:0 loc:0x4d1fec0}} Reason: Message:}] Message: Reason: HostIP:10.       240.0.45 PodIP:10.180.13.22 StartTime:0xc4214e56c0 InitContainerStatuses:[] ContainerStatuses:[{Name:ttfp State:{Waiting:<nil> Running:0xc422028f20 Terminated:<nil>} LastTerminationState:{Waiting:<nil> Running:<nil> Terminated:<nil>} Ready:true RestartCount:0 Image       :calico/ttfp:v0.6.0 ImageID:docker://sha256:41a315c2fae94ab27e4242ff822b86641096208df1de4634f7f339aea9934e70 ContainerID:docker://93b563fd43b7805261ccc20d22e1bb3a339d82dc0ee88834410f4f6af04e1169}] QOSClass:Burstable} version:2 podName:client0-cali-ns-0-3203360797-3       mk9m podNamespace:cali-ns-0}

What I find strange is the second log, indicating that the Pod status hasn't been changed and thus is being ignored.

What I find strange is the second log, indicating that the Pod status hasn't been changed and thus is being ignored.

@yujuhong - how do we detect if the pod status changed? Is it simple "DeepEqual" operation? If so, that shouldn't happen, right? If not, maybe there is some bug in how we do the comparison?

The status manager uses the apimachienery definition of DeepEqual. Is that the first status update for that pod that you see?

What I find strange is the second log, indicating that the Pod status hasn't been changed and thus is being ignored.

That means that the kubelet has the up-to-date status stored in the status manager, it doesn't mean that the status manager has already send updates to the apiserver.

The status manager has a internal channel where it used to store the updates to send. To prevent the overflow, it also performs the syncBatch periodically. It's possible that you hit the overflow situation if many pods had status changes during that window.

It's possible that you hit the overflow situation if many pods had status changes during that window.

@wojtek-t you should see Skpping the status update for pod %q for now because the channel is full; status: %+v in the logs before seeing Status Manager: adding pod: ... if that is the case.

That means that the kubelet has the up-to-date status stored in the status manager, it doesn't mean that the status manager has already send updates to the apiserver.

Right, but I don't see a log anywhere prior to that one indicating that status has already been sent to the status manager, so how would it have that update in its cache?

I'd expect to see either this log or this log if the status was already in the queue (from when it was added to the queue), but I see neither. Or am I missing something?

Is that the first status update for that pod that you see?

@dashpole it's not - I do see a status update occur when the kubelet receives the Pending Pod a little bit before.

May 15 20:28:52 casey-test-default-pool-b0d6f3ea-wqtj kubelet[6842]: I0515 20:28:52.848200    6842 status_manager.go:340] Status Manager: adding pod: "1c4d94dd-39ad-11e7-a787-42010a800122", with status: ('\x01', {Pending [{Initialized True 0001-01-01 00:00:00 +       0000 UTC 2017-05-15 20:28:52 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2017-05-15 20:28:52 +0000 UTC ContainersNotReady containers with unready status: [ttfp]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2017-05-15 20:28:52 +0000 UTC  }]   10.240       .0.45  2017-05-15 20:28:52 +0000 UTC [] [{ttfp {&ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 calico/ttfp:v0.6.0  }] Burstable}) to podStatusChannel

May 15 20:28:52 casey-test-default-pool-b0d6f3ea-wqtj kubelet[6842]: I0515 20:28:52.935286    6842 status_manager.go:147] Status Manager: syncing pod: "1c4d94dd-39ad-11e7-a787-42010a800122", with status: (1, {Pending [{Initialized True 0001-01-01 00:00:00 +0000        UTC 2017-05-15 20:28:52 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2017-05-15 20:28:52 +0000 UTC ContainersNotReady containers with unready status: [ttfp]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2017-05-15 20:28:52 +0000 UTC  }]   10.240.0.4       5  2017-05-15 20:28:52 +0000 UTC [] [{ttfp {&ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 calico/ttfp:v0.6.0  }] Burstable}) from podStatusChannel

May 15 20:28:52 casey-test-default-pool-b0d6f3ea-wqtj kubelet[6842]: I0515 20:28:52.998079    6842 status_manager.go:444] Status for pod "client0-cali-ns-0-3203360797-3mk9m_cali-ns-0(1c4d94dd-39ad-11e7-a787-42010a800122)" updated successfully: {status:{Phase:Pe       nding Conditions:[{Type:Initialized Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x4d1fec0}} LastTransitionTime:{Time:{sec:63630476932 nsec:0 loc:0x4d1fec0}} Reason: Message:} {Type:Ready Status:False LastProbeTime:{Time:{sec:0 nsec:0 loc:0x4d1fec0}} LastTrans       itionTime:{Time:{sec:63630476932 nsec:0 loc:0x4d1fec0}} Reason:ContainersNotReady Message:containers with unready status: [ttfp]} {Type:PodScheduled Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x4d1fec0}} LastTransitionTime:{Time:{sec:63630476932 nsec:0 loc:0       x4d1fec0}} Reason: Message:}] Message: Reason: HostIP:10.240.0.45 PodIP: StartTime:0xc4214e56c0 InitContainerStatuses:[] ContainerStatuses:[{Name:ttfp State:{Waiting:0xc4214e5680 Running:<nil> Terminated:<nil>} LastTerminationState:{Waiting:<nil> Running:<nil> Term       inated:<nil>} Ready:false RestartCount:0 Image:calico/ttfp:v0.6.0 ImageID: ContainerID:}] QOSClass:Burstable} version:1 podName:client0-cali-ns-0-3203360797-3mk9m podNamespace:cali-ns-0}

So I added some additional debug logging and I think I'm seeing the pod status channel start to back up due to more incoming status updates than are being pulled off.

For example, I see one update get queued behind 55 other pod status updates. It eventually gets synced as part of a batch 30s later.

Then, the sync thread finally starts pulling all the updates off the channel and notices that the status from the channel is up-to-date.

i.e. I get a LOT of these in a row:

Status for pod "7c31c4a4-4720-11e7-a47a-42010a800002" is up-to-date; skipping

So the next question I had was "why isn't the status thread pulling the updates off the channel quickly enough?"

Looks like I'm seeing this order of events:

00:51:11.132894: Status manager starts a batch sync
00:51:14.870472: Status manager receives the Pod status update
00:51:37.936072: Status manager finishes the batch sync.
00:51:37.936249: Status manager syncs a single Pod status update.
00:51:37.936418: Status manager starts another batch sync.
00:51:50.336730: Status manager syncs the Pod status from 00:51:14 during the batch
00:51:52.530999: Status manager finishes the second batch sync.

So, it looks like it's taking a long time because the status update is received in the middle of a pretty long batch sync and thus isn't picked up until the next batch sync. But why is the batch sync taking so long?

I'm starting about 300 pods across 10 nodes when I see this happen.

So the next question I had was "why isn't the status thread pulling the updates off the channel quickly enough?"

Most likely caused by the QPS limit I mentioned in #39113 (comment)
(ref: #14955)

Most likely caused by the QPS limit I mentioned in

I thought this as well, but I'm seeing this behavior even when I set the --event-qps=0 option on the kubelet which I would expect to disable the rate limiting.

You need to set the kube-api-qps and the kube-api-burst, in addition to event-qps and event-burst
https://github.com/kubernetes/kubernetes/blob/v1.8.0-alpha.0/cmd/kubelet/app/options/options.go#L267-L268
https://github.com/kubernetes/kubernetes/blob/v1.8.0-alpha.0/cmd/kubelet/app/options/options.go#L199-L200

d'oh, thanks. I totally missed those options... I'll give it another run and see if that makes things better.

I re-tested with all the above flags set to 1000, results look significantly better but I'm still seeing this take a while.

In my tests, the 99th percentile time drops from ~50 seconds to ~10 seconds by adjusting the QPS. A quick look at the logs seems to indicate that the channel is no longer filling up and we're not relying on the batch sync anymore, which is good, but it's still taking almost 10 seconds for the status to reach the status manager in the first place. I'll need to investigate this part now.

As far as defining acceptable latency, I think we need at least a 99th percentile of < 5s.

Do we have any sort of priority queue for Kubelet -> API ops?

…

On Sat, Jun 3, 2017 at 1:03 PM, Casey Davenport ***@***.***> wrote: I re-tested with all the above flags set to 1000, results look significantly better but I'm still seeing this take a while. In my tests, the 99th percentile time drops from ~50 seconds to ~10 seconds by adjusting the QPS. A quick look at the logs seems to indicate that the channel is no longer filling up and we're not relying on the batch sync anymore, which is good, but it's still taking almost 10 seconds for the status to reach the status manager in the first place. I'll need to investigate this part now. As far as defining acceptable latency, I think we need at *least* a 99th percentile of < 5s. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#39113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVNIcJNYQyx7gOgi-sziL1W5NPb5Tks5sAbwHgaJpZM4LTgyU> .

As far as defining acceptable latency, I think we need at least a 99th percentile of < 5s.

@caseydavenport, how many pods did you create in one shot? kubelet/docker is known to slow down during batch pod creation/deletion.

Do we have any sort of priority queue for Kubelet -> API ops?

No, but we have a separate client for events.

May I ask why this is a 1.7 issue? I am not sure what we want to achieve here for 1.7...

I'm starting about 300 pods across 10 nodes when I see this happen.

I essentially am scaling up a Deployment from 0->300 when I see this happen.

I similar results scaling 0->30 on a single node cluster.

Its not surprising in batch updates (0->30 on one node is very much batch) and that's not a realistic case. Perhaps the SLO here should include number of new replicas per node per second (churn)? In other words, yeah, it's bad, and we should try to fix it, but it is a somewhat bogus yardstick. :) On Jun 5, 2017 10:22 AM, "Casey Davenport" <notifications@github.com> wrote: I'm starting about 300 pods across 10 nodes when I see this happen. I essentially am scaling up a Deployment from 0->300 when I see this happen. I similar results scaling 0->30 on a single node cluster. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#39113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVEapPVcFLXPd-yc6tmrSOg1mJZB8ks5sBDlxgaJpZM4LTgyU> .

Perhaps the SLO here should include
number of new replicas per node per second (churn)?

+1.

I believe so far we've only measured a single pod startup time in a steady node (w/ N pods running) in our e2e perf tests, so it's not surprising if kubelet can't meet your latency goal. If we want to new objectives, they should probably be discussed more widely (and w/ scalability sig)

I also agree that it's not very realistic scenario. You can indeed try starting a lot of pods at the same time, but not necessarily on the same node. And yes - we should definitely discuss it in the context of new/revisted upcoming SLIs/SLOs.

I guess it depends on what we mean by "realistic". Do we mean not regarded as "steady state" that the SLOs assume? Or do we mean won't commonly be something users encounter?

If the latter then I think having k8s cluster running near resource capacity with ~30 pods per node is realistic. And nodes in a k8s cluster unexpectedly restarting (perhaps after being preempted, or docker or kubelet restart) is realistic. And in that scenario then ~30 pods being schedule to a single node quickly would also be realistic? Or maybe we assume that the user will have enough capacity on the other nodes that when the restarting node restarts then nothing gets schedule to it until something else changes?

should probably be discussed and included in the k8s SLO doc (http://go/k8s-slo) @wojtek-t drafted

Is that the right link?

go is our internal URL shortener. https://docs.google.com/document/d/15rD6XBtKyvXXifkRAsAVFBqEGApQxDRWM3H1bZSBsKQ/edit#heading=h.bfo14krwzxke

…

On Mon, Jun 5, 2017 at 11:20 AM, Alex Pollitt ***@***.***> wrote: should probably be discussed and included in the k8s SLO doc ( http://go/k8s-slo) @wojtek-t <https://github.com/wojtek-t> drafted Is that the right link? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#39113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVJkB8g3xMMP0SflQzRSC4AKw11Okks5sBEb0gaJpZM4LTgyU> .

30 pods per node is realistic. Bulk starting all 30 at the same time is not :)

…

On Mon, Jun 5, 2017 at 11:19 AM, Alex Pollitt ***@***.***> wrote: I guess it depends on what we mean by "realistic". Do we mean not regarded as "steady state" that the SLOs assume? Or do we mean won't commonly be something users encounter? If the latter then I think having k8s cluster running near resource capacity with ~30 pods per node is realistic. And nodes in a k8s cluster unexpectedly restarting (perhaps after being preempted, or docker or kubelet restart) is realistic. And in that scenario then ~30 pods being schedule to a single node quickly would also be realistic? Or maybe we assume that the user will have enough capacity on the other nodes that when the restarting node restarts then nothing gets schedule to it until something else changes? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#39113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVMc44dAwJN-4-i7t3c3mFSwN2qZ_ks5sBEargaJpZM4LTgyU> .

Is that the right link?

I pasted the wrong link, but I don't think the doc is public anyway even though it said "shared publicly". I removed the link to not mislead more people into clicking it.

We could do it if we define a more relaxed goal (not literally at the same time) AND invest more efforts into optimizing the architecture/implementation. All I wanted to say is that this deserves a wider discussion and may not be easily achievable in a short time frame.

And in the cases you described, kubelet does try to start all the pods as soon as possible. It just doesn't complete (or report completed) in 5s, and that's something we currently don't support.

Updates on this?

I think the latest thinking here is that the expectation that the kubelet will keep up with the scheduler on small (<10 nodes) clusters is unrealistic and that as a result this isn't as serious as originally thought.

As for short-term actions, we should update the SLOs to indicate churn rate bounds.

Deciding on the right long-term solution is a wider discussion, and is certainly not going to happen in the v1.7 timeframe, so we should remove this from the milestone.

Deciding on the right long-term solution is a wider discussion, and is certainly not going to happen in the v1.7 timeframe, so we should remove this from the milestone.

Agreed. Milestone removed.

I revived one of my old refactoring PRs that seemed like it could help in this situation. Feel free to take a look if you are curious. #47564

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

FWIW we're seeing multi-minute queuing of updates due to this on dense nodes (64 cores, 20-30 pods in a batch per node) - and this results in failures because until the status updates are pushed to the API server, kube-dns can't publish the endpoint, and other services being brought up can't use the new pods.

Three things are triggering this for us:

test builds (spawn off a bunch of microservices - about 30 cores worth of limits)
autoscale scale down events (kill off a node thats 40% full and you get ~40 pods rescheduled, and sometimes thats all landing on one other largely empty high density node)
spot instance churn

This is a 40 node cluster, well ahead of that <10 threshold mentioned earlier.

docker is keeping up with the new load rate for us, the problem seems to be entirely this interaction w/in kubelet.

Can we resurrect this or start a new bug @caseydavenport

/reopen

@aermakov-zalando: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

We've just been bitten by something similar on 1.13.5. On some of the nodes, kubelet was starting the init containers (and posting the events about it), but pod statuses weren't updated at all for 3-4 minutes.

Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:16:51Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.11", GitCommit:"637c7e288581ee40ab4ca210618a89a555b6e7e9", GitTreeState:"clean", BuildDate:"2018-11-26T14:25:46Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Using Calico CNI v1.11.2

I'm working on what it look like a related issue with kube2iam.
Kube2iam is a tool that intercept requests to the AWS metadata api, uses the podIP to retrieve pod info in his events cache, and based on a pod annotation return the right credentials to to pod.

Looking at the metrics I've noticed that sometimes there were miss in the events cache, for solving this cases I've decided to query the api server.

I've noticed that for brand new containers the API servers and the Kubelet when queried will not return the pod IP.
I'm running a test branch in my cluster and I can see, right after the pod starts there is a init loop to get some resources from AWS api, the call gets intercepted by kube2iam:

time="2020-03-13T20:17:57Z" level=info msg="getPodFromAPIByIP: kubelet Searching IP 100.112.72.199 for: spec.nodeName==ip-XXX.us-west-2.compute.internal"
time="2020-03-13T20:17:57Z" level=info msg="getPodFromAPIByIP: kubelet Found container olfactory-eel-k8s-spot-termination-handler-9bkgj with IP: 100.112.72.194"
time="2020-03-13T20:17:57Z" level=info msg="getPodFromAPIByIP: kubelet Found container node-exporter-65bkz with IP: 10.4.81.144"
time="2020-03-13T20:17:57Z" level=info msg="getPodFromAPIByIP: kubelet Found container sandbox11-default-7bc84875-w644t with IP: "
time="2020-03-13T20:17:57Z" level=info msg="getPodFromAPIByIP: kubelet Found container kube-proxy-ip-10-4-81-144.us-west-2.compute.internal with IP: "
time="2020-03-13T20:17:57Z" level=info msg="getPodFromAPIByIP: kubelet Found container ignoble-detector-79fqr with IP: 10.4.81.144"
time="2020-03-13T20:17:57Z" level=info msg="getPodFromAPIByIP: kubelet Found container calico-node-cdnnj with IP: 10.4.81.144"
time="2020-03-13T20:17:57Z" level=info msg="getPodFromAPIByIP: kubelet Found container service-web-97c47764c-l8rvh with IP: "
time="2020-03-13T20:17:57Z" level=info msg="getPodFromAPIByIP: kubelet Found container quaffing-pronghorn-overprovisioner-9b848f7b9-pldxg with IP: "
time="2020-03-13T20:17:57Z" level=info msg="getPodFromAPIByIP: kubelet Found container logging-m4fqf with IP: 100.112.72.197"
time="2020-03-13T20:17:57Z" level=info msg="getPodFromAPIByIP: kubelet Found container ignoble-bunny-monitor-hv9n5 with IP: 100.112.72.195"
time="2020-03-13T20:17:57Z" level=info msg="getPodFromAPIByIP: kubelet Found container coredns-ds-6mzlw with IP: 10.4.81.144"
time="2020-03-13T20:17:57Z" level=info msg="getPodFromAPIByIP: kubelet Found container rolling-lamb-kube2iam-78kqt with IP: 10.4.81.144"
time="2020-03-13T20:17:57Z" level=info msg="getPodFromAPIByIP: kubelet Found container monitoring-collector-sfzld with IP: 100.112.72.192"

kubectl get pods --field-selector=status.podIP=100.112.72.199 --all-namespaces
NAMESPACE                      NAME                                               READY   STATUS    RESTARTS   AGE
staging   service-web-97c47764c-l8rvh   2/2     Running   0          21m

same result querying the api server or the local kubelet api.
Another stranger thing are the missing IPs for the other pods as well, there where all started around the same time:

kubectl get pod quaffing-pronghorn-overprovisioner-9b848f7b9-pldxg  -o wide
NAME                                                 READY   STATUS    RESTARTS   AGE   IP               NODE
quaffing-pronghorn-overprovisioner-9b848f7b9-pldxg   1/1     Running   0          23m   100.112.72.201   ip-XXus-west-2.compute.internal

kubectl get pod sandbox11-default-7bc84875-w644t -o wide
NAME                                                      READY   STATUS    RESTARTS   AGE   IP               NODE
sandbox11-default-7bc84875-w644t    1/1     Running   0          26m   100.112.72.200   ip-XX.us-west-2.compute.internal

It looks like it really takes quite some time for the kubelet to update its info

Is there a way I can get the pod by its podIP deterministically?

This is consistent with the previous analysis here, but this ticket isn't open, unless you can get the filer or a collaborator to reopen it, you may be better of opening a new ticket and pointing at this one from that.

thank you @rbtcollins i was hoping on re-opening this so that everybody involved gets notified.
@thockin maybe can you re-open?

/remove-lifecycle rotten

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/reopen

this is still an issue

@caseydavenport: Reopened this issue.

In response to this:

/reopen

this is still an issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/reopen

this is still an issue

@fasaxc: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

this is still an issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/reopen

@caseydavenport: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/triage accepted

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/reopen

@fasaxc: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/reopen

@caseydavenport: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/reopen

@wreed4: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Is there any likelihood that this will get actioned and fixed?? In particular using https://argoproj.github.io/argo-workflows/, we run hundreds or thousands of workflows at once time which launch hundreds or thousands of pods onto ~300 nodes. it's very very common that >50 of them will get scheduled onto a node at once time. further more, these pods are short-lived, so they are replaced by other pods very quickly. The kubelet starts to slow down very quickly.

I've found that setting the qps limit much higher than the default fixes this case, but it would be good to remove the race condition entirely.

/reopen

cc @SergeyKanzhelev @ehashman

@dims: Reopened this issue.

In response to this:

/reopen

cc @SergeyKanzhelev @ehashman

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Note that the AWS VPC CNI documentation released on 19 Oct 2021 describes a workaround for this issue in versions released in the last six months (1.9.3+).

https://github.com/aws/amazon-vpc-cni-k8s#annotate_pod_ip-v193

ANNOTATE_POD_IP (v1.9.3+)
Type: Boolean as a String

Default: false

Setting ANNOTATE_POD_IP to true will allow IPAMD to add an annotation vpc.amazonaws.com/pod-ips to the pod with pod IP.

There is a known #39113 with kubelet taking time to update Pod.Status.PodIP leading to calico being blocked on programming the policy. Setting ANNOTATE_POD_IP to true will enable AWS VPC CNI plugin to add Pod IP as an annotation to the pod spec to address this race condition.

To annotate the pod with pod IP, you will have to add "patch" permission for pods resource in aws-node clusterrole. You can use the below command -
cat << EOF > append.yaml
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - patch
EOF
kubectl apply -f <(cat <(kubectl get clusterrole aws-node -o yaml) append.yaml)

Note to future people who are having this issue: you should probably open a new issue instead of recycling this one, as the original bug filed is from 2016 and the cause of the problem is likely something different.

@bowei although this is a 2016 issue, that AWS VPC CNI 1.9.3 release was only a few months ago (19 Oct 2021) and specifically cited this issue #39113 as the issue it fixed, so it seems likely to me to be directly relevant for peeps coming here?

There is a known #39113 with kubelet taking time to update Pod.Status.PodIP leading to calico being blocked on programming the policy. Setting ANNOTATE_POD_IP to true will enable AWS VPC CNI plugin to add Pod IP as an annotation to the pod spec to address this race condition.

might be related to #85966

@bowei Wearing my skeptic hat, after fighting with this 4 years ago and still seeing it from time to time (different org now, but we track pod bring up latencies as an internal SLO), and theres been no evidence that this has been fixed systematically and has found a new way to the same symptoms: instead, explicitly, this is just bitrotting because no one with enough cycles to tackle it has done so (e.g. a drive by contributor with time to drive the k8s process, or a core contributor that wants to fix it).

I think that the first step is to add a metric, so users can report more accurate information

I think this is how it goes,

IPs are determined here

kubernetes/pkg/kubelet/kuberuntime/kuberuntime_manager.go

Lines 845 to 846 in 11686e1

    
           podIPs = m.determinePodSandboxIPs(pod.Namespace, pod.Name, resp.GetStatus()) 
        
           klog.V(4).InfoS("Determined the ip for pod after sandbox changed", "IPs", podIPs, "pod", klog.KObj(pod))

it's being patched here

kubernetes/pkg/kubelet/status/status_manager.go

Lines 684 to 685 in 11686e1

    
           newPod, patchBytes, unchanged, err := statusutil.PatchPodStatus(m.kubeClient, pod.Namespace, pod.Name, pod.UID, pod.Status, mergedStatus) 
        
           klog.V(3).InfoS("Patch status for pod", "pod", klog.KObj(pod), "patch", string(patchBytes))

@dgrisonnet are you familiar with the metrics on the kubelet?
^^^ this process seems to cross several bounderies on the kubelet logic?
is it possible to have such metric?

I am not familiar enough with the kubelet code to be able to say whether that could be possible or not, but theoretically, it would make sense in my opinion to record the timestamp at which the runtime assigned an IP to the pod and convey this information to the status manager to then create a histogram based on the time passed since the IP was assigned to the pod.

After looking a bit at the code, in practice, it seems a bit difficult to wire that up since the pod status manager is an independent process and the only way to convey information to it seems to be via the versionPodStatus structure, but I am unsure if that really makes sense since I am not familiar with that code so we will need people from node to chime in.

Overall this sounds like a good observability improvement that is worth its own separate issue.

created an issue to track the effort #110815

Any update on this?

Any update on this?

there are metrics now, if someone report the issue with the metrics and the logs is it possible to investigate it, but as this comment correctly points out #39113 (comment) it is better to open a new issue to avoid misunderstanding

	podIPs = m.determinePodSandboxIPs(pod.Namespace, pod.Name, resp.GetStatus())
	klog.V(4).InfoS("Determined the ip for pod after sandbox changed", "IPs", podIPs, "pod", klog.KObj(pod))

	newPod, patchBytes, unchanged, err := statusutil.PatchPodStatus(m.kubeClient, pod.Namespace, pod.Name, pod.UID, pod.Status, mergedStatus)
	klog.V(3).InfoS("Patch status for pod", "pod", klog.KObj(pod), "patch", string(patchBytes))

Kubelet takes long time to update Pod.Status.PodIP