Handler does not fire on an Azure cluster

Question

Handler does not fire on an Azure cluster

aristidesneto opened this issue a year ago · comments

Long story short

I have a handler that listens to events from a deployment that contains a certain declared annotation.

The events are: on creation and update.

The logic is that when the hash of the application image is changed during a deploy, the operator identifies and triggers the handler and executes the tasks that were programmed, which are the deletion and creation of a secret.

When the deployment that contains the annotation defined by me is imaged, nothing happens. There are no logs on the operator pod and I don't know how to identify a possible problem why it doesn't fire.

In this case, if I delete the operator pod, when starting it, it identifies this change in deploy and triggers the handler, which is what I expect it to do.

Weird is that the problem is happening on an Azure cluster. On some clusters on GCP it works successfully.

Is there anything I can analyze, debug this problem, because I don't know.

Below is my RBAC file to analyze if it could be something related to permissions.

Kopf version

1.36.1

Kubernetes version

1.24.10

Python version

3.10.6

Code

# main.py
import kopf
import ... # and others

@kopf.on.field(
    'deployments',
    field='spec.template.spec.containers',
    annotations={'my-custom-annotation': 'true'},
    old=kopf.PRESENT,
    new=kopf.PRESENT
)
def image_changed(name, spec, old, new, logger, namespace, annotations, **kwargs):
    # My code here
    # ...
    pass

----------------------
# my-app-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app-dp
  namespace: default
  annotations:
    my-custom-annotation: "true"
...
----------------------
# RBAC File YAML
---
apiVersion: v1
kind: ServiceAccount
metadata:
  namespace: operator
  name: my-operator-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: my-operator-role-cluster
rules:
- apiGroups: ["apps"]
  resources: ["deployments", "deployments/status"]
  verbs: [get, list, watch, patch]
- apiGroups: ["*"]
  resources: ["secrets"]
  verbs: [get, list, create, delete]
- apiGroups: [""]
  resources: [events]
  verbs: [create]
- apiGroups: [""]
  resources: [namespaces]
  verbs: [list, watch]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: my-operator-cluster
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: my-operator-role-cluster
subjects:
- kind: ServiceAccount
  name: my-operator-sa
  namespace: operator

Logs

/usr/local/lib/python3.11/site-packages/kopf/_core/reactor/running.py:176: FutureWarning: Absence of either namespaces or cluster-wide flag will become an error soon. For now, switching to the cluster-wide mode for backward compatibility.
  warnings.warn("Absence of either namespaces or cluster-wide flag will become an error soon."
[2023-07-24 16:40:52,753] kopf._core.engines.a [INFO    ] Initial authentication has been initiated.
[2023-07-24 16:40:52,756] kopf.activities.auth [INFO    ] Activity 'login_via_client' succeeded.
[2023-07-24 16:40:52,756] kopf._core.engines.a [INFO    ] Initial authentication has finished.
[2023-07-24 16:40:53,264] kopf._core.reactor.o [WARNING ] Not enough permissions to watch for resources: changes (creation/deletion/updates) will not be noticed; the resources are only refreshed on operator restarts.

Additional information

No response

Sergey Vasilyev · Answer 1 · Tue Jul 25 2023 04:51:15 GMT+0800 (China Standard Time)

That might be caused by a known issue with yet unknown solution: Kubernetes sometimes “loses” the connection without closing it. Since the connection is technically opened, Kopf does not reconnect and believes that nothing happens in the cluster.

If you search through issues, Azure is mentioned several times as especially affected by this. My guess is that the problem is in load balancers and their connections to the real control plane (in the chain: kopf->lb->k8s). Unconfirmed though.

Often, setting the client-side (i.e. kopf-side) connection timeout helps (see settings). Not the best solution, but it works: the operator might be not noticing the changes for the configured time out (e.g 10 mins), or will have to reconnect too often (if you set it to 1m). The “good” value depends on your individual case, there is no “good default”.

I see no way to fix this on the Kopf side, unless there is some kind of ping-pong machinery in k8s above low-level tcp.

Aristides Neto · Answer 2 · Tue Jul 25 2023 22:12:23 GMT+0800 (China Standard Time)

Hi, thanks for the quick response.

I'm not quite sure how I'm going to proceed yet, but I'll try to look into something more related to Azure and analyze the traffic further.

I will also validate changes to the connection timeout.

Thanks

Francesco Timperi Tiberi · Answer 3 · Tue Jul 25 2023 22:53:20 GMT+0800 (China Standard Time)

We are facing similar problem with Azure too. I was thinking about configuring a liveness probe which will cause a request to k8s api somehow to keep the connection alive, in case for whatever reason the LB kills the connection after some time if they are not used. @nolar what do you think?

Sergey Vasilyev · Answer 4 · Tue Jul 25 2023 22:59:59 GMT+0800 (China Standard Time)

@francescotimperi The key problem is the existing connection, not a new one. The probe will be showing success since new tcp connections are landing fine. It is the existing connection that remains connected but dysfunctional. You need a response from k8s there to validate it — this is what I meant by ping-ponging. But I saw no such feature in k8s.