lyft / flinkk8soperator

Kubernetes operator that provides control plane for managing Apache Flink applications

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Automate handling of dead task managers

tweise opened this issue · comments

Occasionally we see task manager JVM process get stuck, with the task manager no longer registered with Flink but the process also not able to exit. This leads to a job recovery crash loop due to insufficient resources (missing task slots). It would be good if the operator could detect lost task managers and delete corresponding pods so that replacement TMs can come up and the application recover.

The manual process is to find the task manager IPs in the pod list that are not registered with Flink (listed in the Flink UI) and then kubectl delete those pods.