nolar / kopf

A Python framework to write Kubernetes operators in just a few lines of code

Home Page:https://kopf.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

kopf does not fire handlers after re-authentication

lkoniecz opened this issue · comments

Long story short

Process was running for a while.
Re-authentication process was ran and after that kopf does not react to any resources being created.

Kopf version

1.36.1

Kubernetes version

1.24

Python version

3.9

Code

default @kopf.on.startup()

Logs

[2023-07-11 14:04:20,360] kopf.objects         [DEBUG   ] [lukasz-operator/health] Handling cycle is finished, waiting for new changes.
[2023-07-11 14:39:56,795] kopf.objects         [DEBUG   ] [lukasz-operator/hello-world-1] Adding the finalizer, thus preventing the actual deletion.
[2023-07-11 14:39:56,796] kopf.objects         [DEBUG   ] [lukasz-operator/hello-world-1] Patching with: {'metadata': {'finalizers': ['kopf.zalando.org/KopfFinalizerMarker']}}
[2023-07-11 14:39:57,265] kopf._core.engines.a [INFO    ] Re-authentication has been initiated.
[2023-07-11 14:39:57,266] kopf.activities.auth [DEBUG   ] Activity 'login_via_client' is invoked.
[2023-07-11 14:39:58,376] kopf.activities.auth [DEBUG   ] Client is configured via kubeconfig file.
[2023-07-11 14:39:58,377] kopf.activities.auth [INFO    ] Activity 'login_via_client' succeeded.
[2023-07-11 14:39:58,377] kopf._core.engines.a [INFO    ] Re-authentication has finished.

Additional information

No response

I suspect the issue I have is the same root cause (whatever it might be) in 1.36.1.

I have an aiocron task that runs regularly to refresh credentials from a 3rd party API in a Memo.

Works on 1.36.0, not on 1.36.1 🤔

I'm having the same problem.

After several hours of debugging I think I found the reason.

kopf initiates the re-authentication when no usable ConnectionInfo objects exist in it's so called Vault.

This can happen when either kopf got an 'unauthorized error' from the API server or if the ConnectionInfo that you are using has expired. kopf will remove the no longer working ConnectionInfo objects from it's Vault and when doing this also calls .close() on the underlying aiohttp.ClientSession. The next task that needs a ConnectionInfo will trigger the 're-authentication'.

The problem now is that the old aiohttp.ClientSession instance is closed, but there may still be open aiohttp.ClientResponse objects in the system that depend on that session. These tasks will hang until they are interrupted in some way, e.g. by a connection reset or timeout. Depending on your settings this can take long or forever.

You can work around this by setting a low client_timeout, e.g.

settings.watching.client_timeout = 60

With this setting your operator should recover after that timeout has expired.

I have a patch that fixes the problem by keeping track of all the unclosed response objects so that those can be properly closed before closing the session. With this patch, I can no longer reproduce this dead-lock situation.
I'll test it some more and will then submit a PR.