kiwigrid / k8s-sidecar

Watching a set of configmaps fails to be alerted of new changes after 10 minutes of no changes.

##Repro steps

Install the following configmap into a cluster

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-test-dashboard
  labels:
    grafana_dashboard: "1"
data:
  cm-test.json: {}

Install the sidecar into the cluster

apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
    - env:
        - name: METHOD
        - name: LABEL
          value: grafana_dashboard
        - name: FOLDER
          value: /tmp/dashboards
        - name: RESOURCE
          value: both
      image: kiwigrid/k8s-sidecar:0.1.178
      imagePullPolicy: IfNotPresent
      name: grafana-sc-dashboard

Wait 10 minutes
Make a change to the config map and update in the cluster

Expected Behaviour

Will see a modification occur

Actual Behaviour

Nothing

Done on AKS with kubernetes version 1.16.10

I have same issue with aks 1.16.10 and sidecar 0.1.151.
When the sidecar startup, it ran ok, however the loop failed after 30 minutes with this error:
[2020-08-27 15:25:47] ProtocolError when calling kubernetes: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

@monotek Could you please take care of it?

Please try with current version 0.1.193.

Please try with current version 0.1.193.

The same issue.

As a containment I'm using SLEEP mode, which actually polls the kube-api.

But what's causing the problem in WATCH is currently unclear for me. Maybe the used k8s client in the sidecar having issues...

I created #90, maybe this helps 🤞🏼

@auroaj & @PeterGerrard
#90 was merged.
please try with 0.1.209

@axdotl
mentioning sleep command, do you think this could help too: https://github.com/kiwigrid/k8s-sidecar/pull/88/files ?

Any updates on this? I'm getting the same issue and sidecar stops working after couple of minutes.

No, as you can see there was no feedback if it works with image 0.1.209.

No, as you can see there was no feedback if it works with image 0.1.209.

I have tried it with image 0.1.209, and it doesn't work

Do anyone of you know whats the last working version?

I think it is a general issue. Run long time w/ 0.1.20 where the issue also occurs... maybe it is more related to changes in k8s api.

So lastg working k8s version would be interesting too.

The title says after 10min it stops to work, I adjusted the test to add a new configmap after 11min and it works with all k8s versions. Can anyone say after what time they observed the problem?

My assumption is, that this is related to interrupts to kube-api. This might cause the resource watching to stop.

The title says after 10min it stops to work, I adjusted the test to add a new configmap after 11min and it works with all k8s versions. Can anyone say after what time they observed the problem?

3-4 hours for me.

Checked with kiwigrid/k8s-sidecar:1.1.0 and AKS K8s Rev: v1.17.11.
The same issue.

There is an interesting fix in the Kubernetes Python Client v12.0.0. From the https://github.com/kubernetes-client/python/blob/release-12.0/CHANGELOG.md

Retry expired watches kubernetes-client/python-base#133

Thanks for merging. I updated the deployment yesterday to the docker image tag 0.1.259 and this morning, 15 hours later, it still detects modifications on configmaps 👍
There is also a change in the log. About every 30 to 60 minutes there's an entry like:

[2020-10-28 07:29:07] ApiException when calling kubernetes: (410)
Reason: Expired: too old resource version: 194195784 (199007602)

And so the resource watcher gets restarted.

BTW, the tag 1.2.0 had a build error, that's why I used 0.1.259 from the CI build.

Hmmm...
Not in my case...
Still the same issue.
Also, I can't find this entry:

ApiException when calling kubernetes: (410)

I've checked twice, with grafana-sc-dashboard kiwigrid/k8s-sidecar:0.1.259, it stops working.

K8s Rev: v1.17.11

I'll try to check it in another location.

I'm also still getting an issue with version 'k8s-sidecar:0.1.259'. It stops working after a couple of minutes

Checked in another location. Unfortunately, it was updated to v1.17.11 too.
I've got the same result - just several minutes (instead of hours with prev) before stop.
And I've found this line in logs in one of my envs (the only one):

[2020-10-28 14:03:21] ProtocolError when calling kubernetes: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

are we sure that 0.1.259 matches the 1.2.0 release ? the deploy stage failed: https://app.circleci.com/pipelines/github/kiwigrid/k8s-sidecar/47/workflows/f0000c91-ba71-42b7-828d-0f235915ab29/jobs/274

are we sure that 0.1.259 matches the 1.2.0 release ? the deploy stage failed: https://app.circleci.com/pipelines/github/kiwigrid/k8s-sidecar/47/workflows/f0000c91-ba71-42b7-828d-0f235915ab29/jobs/274

It seems to be.

The only commit in the 1.2.0 release was the kubernetes library bump to 12.0.0 and if you check the libraries in that image, you'll see that the kubernetes library is updated.

❯ docker run --rm -it kiwigrid/k8s-sidecar:0.1.259 sh
/app $ pip list | grep kubernetes
WARNING: The directory '/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
kubernetes        12.0.0

However, I still had the same issue as @qaiserali where it stopped working after a few minutes.

Any luck from anyone with a more recent version ? We moved to SLEEP method, but now the dashboards are not getting removed on ConfigMap deletion :(

@djsly We noticed exactly the same thing about dashboards not getting deleted.

With a specific use of this sidecar (Grafana) earlier today, this behaviour under SLEEP mode caused a denial-of-service on one of our Grafana environments when a config map was effectively renamed (same content, one ConfigMap deleted, another one added). As a result of the old name dashboard not being deleted; and then the new one being detected as not having a unique uid led to grafana/grafana#14674. Apart from hundreds of thousands of logged Grafana errors, this seemingly caused the dashboard version to churn in the Grafana DB between two different versions; bloating our DB and eventually running out of space for the grafana DB on our PVs. Cool. :-)

So we went back to WATCH mode; and upgraded the sidecar to 1.3.0 which includes the kube client bump.

In our case WATCH mode was only occasionally losing connectivity/stopping watching that we had noticed, rather than the every 10 minutes/few hours that some people have observed here. Since we run on AWS EKS, one theory was that it was during control plane/master upgrades by AWS that the watches might get terminated and not re-established reliably, but that was just a theory given how infrequent we had experienced the issues with WATCH. Will see how we go.

@chadlwilson running in AKS here, and I can guarantee you that the APIServer restarts and causes lots of disconnect when using the WATCH methods, we cannot last 1 full 24hour so far :( and even with the bump to 1.3.0,

Confirm the same behavior on AKS (1.17.13), WATCH triggers many such warnings:
grafana-77d75cd48c-fvbhj grafana-sc-dashboard [2020-12-22 16:42:40] ProtocolError when calling kubernetes: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

Even using kiwigrid/k8s-sidecar:1.3.0

Updated to SLEEP mode.

On our side we have been using a new sidecar https://github.com/OmegaVVeapon/kopf-k8s-sidecar

Works like a charm so far, only main difference is the added RBAC to allow the sidecar patching the resources.

Cross posting from kubernetes/kube-state-metrics#694:

I started to think this issue to be mainly happening on to hosted Kubernetes services like AKS and EKS. There are many here using it without any issue, and there is another set of users who are repeatedly experiencing issues.

Also see this weird issue stops suddenly stopping for some set of users in a python implementation.
#85

Maybe the issue is lying in the way how the apiserver is hosted behind a managed load balancer.

Actually lets do do the same quick poll here among AKS/EKS users, can you vote this entry with:

upvote (+1 ) if you are experiencing issue on AKS/EKS
downvote (-1 ) if you are not experincing this issue on AKS/EKS

I don't think it should matter how the API server is hosted. We're running this sidecar as part of Grafana in 1000+ clusters, all of which were set up by us (not a cloud provider) and were seeing intermittent failures until we just switched to polling every 1m. I can't say whether this was after 10m or longer as reloading was mostly irrelevant in our use-case.

I believe the cause would be an intermittent network connection error where a watch hasn't timed out but just dropped and the client does not attempt a reconnect. I believe the fix is described here: kubernetes-client/python#1148 (comment)

@vsliouniaev

I don't think it should matter how the API server is hosted. We're running this sidecar as part of Grafana in 1000+ clusters, all of which were set up by us (not a cloud provider) and were seeing intermittent failures until we just switched to polling every 1m. I can't say whether this was after 10m or longer as reloading was mostly irrelevant in our use-case.

Could you share your workaround/solution, please?

Could you share your workaround/solution, please?

Default behaviour is to watch resources. This can be changed to SLEEP which is a polling.
See README.md - Env Vars METHOD

Drawback of SLEEP: Deletions are not recognized.

I think there was a similar issue with the notifications API that affected aadpodidentity as well.
The fix was to update to the latest Kubernetes libraries which had a fix for this. But not sure if that applies to the python libraries as well. Either way it doesn't work behind a load balancer at the moment.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

I'm also seeing this issue. Running quay.io/kiwigrid/k8s-sidecar:1.12.3 on AKS. Did you all work around this issue somehow or should the issue be reopened?

Still experiencing this issue on AKS with k8s-sidecar:1.14.2.

Is there still the problem that when using LIST deleted resources are not deleted?

Is there still the problem that when using LIST deleted resources are not deleted?

yes. We're not keeping track of resources, with WATCH the sidecar follows the events k8s emits (create/update/delete) whereas SLEEP more or less acts like create/update all the time.

Issue is reproducible with 1.14.2 sidecar as well.

[2021-11-10 03:47:48] ProtocolError when calling kubernetes: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

Thanks for the report. Unfortunately I have no idea what's going wrong here. Any ideas? Current state is that we pass urllib Retry configured by REQ_RETRY_TOTAL ,REQ_RETRY_CONNECT, REQ_RETRY_READ, REQ_RETRY_BACKOFF_FACTOR, REQ_TIMEOUT into k8s client, so in general k8s communication should be retried. But from your report it seems that watching, i.e. streaming updates, is not really subject to those retries.

So I'd be happy to follow any pointers you might have.

Unfortunately I have no idea either.
It seems to be retrying but each retry seems to always fail with the same ProtocolError (see above).

Anyone here is familiar with how the network timeouts are configured when using watches/informers with client-go?
I previously missed this comment #85 (comment)
I strongly believe this has something at the network level.

Ah, I always thought this repo is in golang. Now I checked the code for the first time with @vsliouniaev's kubernetes-client/python#1148 (comment) comment in mind.
And I see that we can test this out pretty easily.

Apparently, details about these are now covered in https://github.com/kubernetes-client/python/blob/master/examples/watch/timeout-settings.md.
Server-side timeout has a default (a random value between 1800 and 3600 seconds) in the python library, but the client-side timeout seems to have None as default:

In case, if the _request_timeout value is not set, then the default value is None & socket will have no timeout.
Refer: https://docs.python.org/3/library/socket.html#socket.getdefaulttimeout

It is recommended to set this timeout value to a lower number (for eg. ~ maybe 60 seconds).

We can give the below change a try here.

Update this part:

k8s-sidecar/sidecar/resources.py

Lines 193 to 199 in cbb48df

    
           additional_args = { 
        
               'label_selector': label_selector 
        
           } 
        
           if namespace != "ALL": 
        
               additional_args['namespace'] = namespace 
        
           stream = watch.Watch().stream(getattr(v1, _list_namespace[namespace][resource]), **additional_args)

As this:

    additional_args = {
        'label_selector': label_selector,

        # Tune default timeouts as outlined in
        # https://github.com/kubernetes-client/python/issues/1148#issuecomment-626184613
        # https://github.com/kubernetes-client/python/blob/master/examples/watch/timeout-settings.md
        # I picked 60 and 66 due to https://github.com/nolar/kopf/issues/847#issuecomment-971651446

        # 60 is a polite request to the server, asking it to cleanly close the connection after that.
        # If you have a network outage, this does nothing.
        # You can set this number much higher, maybe to 3600 seconds (1h).
        'timeout_seconds': os.environ.get(WATCH_SERVER_TIMEOUT, 60),

        # 66 is a client-side timeout, configuring your local socket.
        # If you have a network outage dropping all packets with no RST/FIN,
        # this is how long your client waits before realizing & dropping the connection.
        # You can keep this number low, maybe 60 seconds.
        '_request_timeout': os.environ.get(WATCH_CLIENT_TIMEOUT, 66),
    }
    ...
    stream = watch.Watch().stream(getattr(v1, _list_namespace[namespace][resource]), **additional_args)

This is also effectively what the alternative kopf based implementation does here, also see nolar/kopf#585 on the historical context on these settings:
https://github.com/OmegaVVeapon/kopf-k8s-sidecar/blob/main/app/sidecar_settings.py#L58-L70:

And, ironically the kopf (which OmegaVVeapon/kopf-k8s-sidecar is based on) project has nolar/kopf#847 currently open which seems to be related, but I guess that's another edge case. We have been using kopf in our AKS clusters without regular issues. But this particular kiwigrid/k8s-sidecar issue is quite frequent.

I'll try to give my suggestion above a chance if I can reserve some time, but given that we already have a workaround in place (grafana/helm-charts#18 (comment)), it won't likely be soon.

@bergerx Thanks a lot for this detailed analysis. I took the opportunity and incorporated your proposal in a PR. Hopefully we can get this issue fixed with it. 🤞

	additional_args = {
	'label_selector': label_selector
	}
	if namespace != "ALL":
	additional_args['namespace'] = namespace

	stream = watch.Watch().stream(getattr(v1, _list_namespace[namespace][resource]), **additional_args)

Watching stops working after 10 minutes

Expected Behaviour

Actual Behaviour