ip-reconciler fails with "context deadline exceeded" while listing IPPools

Question

ip-reconciler fails with "context deadline exceeded" while listing IPPools

xagent003 opened this issue 3 years ago · comments

The ip-reconciler is repeatedly failing (not just a one off, rare Job). Pod shows in CrashBackoffLoop state, and when we inspect the container logs:

2022-01-05T07:25:24Z [debug] NewReconcileLooper - Kubernetes config file located at: /host/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig

2022-01-05T07:25:25Z [debug] successfully read the kubernetes configuration file located at: /host/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig

2022-01-05T07:25:34Z [debug] listing IP pools

2022-01-05T07:25:35Z [error] failed to retrieve all IP pools: context deadline exceeded

2022-01-05T07:25:35Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded

API server is up and no other workloads show this type of error. The resources are fetch'able via kubectl get ippools.whereabouts.cni.cncf.io - to list all pools, or a specific one in detail.

Does the context timeout need to be reset, is the initial timeout value too small? Should we pass in a dummy context.TODO() before making the client call to get IPPool?

Arjun Baindur · Answer 1 · Sat Jan 08 2022 02:37:05 GMT+0800 (China Standard Time)

The ListPods may be taking a long time, 9 seconds in this case. Should we increase the timeout or should we be generating a new context for each client API call that takes in a context?

@maiqueb ?

Miguel Duarte Barroso · Answer 2 · Fri Feb 04 2022 21:36:55 GMT+0800 (China Standard Time)

The ListPods may be taking a long time, 9 seconds in this case. Should we increase the timeout or should we be generating a new context for each client API call that takes in a context?

@maiqueb ?

I was sloppy enough when I wrote the to use context.TODO when listing the pods. I honestly think that is the single reason why we haven't seen this issue before - and at a larger scale.

Would you check #186 ? Odds are it fixes this issue.