processing pod crashloopbackoff need error handling
pindge opened this issue · comments
Problem
alchemist deployments are largely sitting in Crashlookbackoff
state. Ideally it should gracefully complete if there it cannot start properly due to no available message in queue.
The number next to Crashloopbackoff
is the number of times the pod was started,
alchemist-s2-nrt-processing-wo-5655f489bd-4xwj5 0/1 CrashLoopBackOff 1271 5d18h
alchemist-s2-nrt-processing-wo-5655f489bd-7kcbw 0/1 CrashLoopBackOff 1268 5d18h
alchemist-s2-nrt-processing-wo-5655f489bd-7x7wm 0/1 CrashLoopBackOff 439 45h
alchemist-s2-nrt-processing-wo-5655f489bd-c7twg 0/1 CrashLoopBackOff 1312 6d
alchemist-s2-nrt-processing-wo-5655f489bd-crsch 0/1 CrashLoopBackOff 1313 6d
alchemist-s2-nrt-processing-wo-5655f489bd-fmqj9 0/1 CrashLoopBackOff 1274 5d18h
alchemist-s2-nrt-processing-wo-5655f489bd-gb7cp 0/1 CrashLoopBackOff 1270 5d18h
alchemist-s2-nrt-processing-wo-5655f489bd-grlq7 0/1 CrashLoopBackOff 437 45h
alchemist-s2-nrt-processing-wo-5655f489bd-l56pd 0/1 CrashLoopBackOff 1317 6d
alchemist-s2-nrt-processing-wo-5655f489bd-m75cb 0/1 CrashLoopBackOff 1317 6d
alchemist-s2-nrt-processing-wo-5655f489bd-mhrl6 0/1 CrashLoopBackOff 1216 5d12h
alchemist-s2-nrt-processing-wo-5655f489bd-nn8mq 0/1 CrashLoopBackOff 1272 5d18h
alchemist-s2-nrt-processing-wo-5655f489bd-pjsj2 0/1 CrashLoopBackOff 1314 6d
alchemist-s2-nrt-processing-wo-5655f489bd-qhgjl 0/1 CrashLoopBackOff 1272 5d18h
alchemist-s2-nrt-processing-wo-5655f489bd-qp2m5 0/1 CrashLoopBackOff 1315 6d
alchemist-s2-nrt-processing-wo-5655f489bd-rgm65 0/1 CrashLoopBackOff 436 45h
alchemist-s2-nrt-processing-wo-5655f489bd-rhdnk 0/1 CrashLoopBackOff 335 34h
alchemist-s2-nrt-processing-wo-5655f489bd-rq7lv 0/1 CrashLoopBackOff 338 34h
alchemist-s2-nrt-processing-wo-5655f489bd-w8smj 0/1 CrashLoopBackOff 438 45h
alchemist-s2-nrt-processing-wo-5655f489bd-wh5vl 0/1 CrashLoopBackOff 1317 6d
Proposed solution
need to add exception handling here
datacube-alchemist/datacube_alchemist/cli.py
Lines 132 to 144 in 2424239
To me this sounds more like an issue with Kubernetes and KEDA than with Alchemist itself.
If there are no messages in the queue, there shouldn't be more pods started.
If Alchemist is run against a queue that is empty, I think it should exit cleanly (which I think is what it already does). If this is causing issues with the K8s Pods, I think they need their configuration changed.
Yeah, the issue is that it gets restarted en exits cleanly, but then k8s restarts it again.
This was resolve by Kirill in Statistician by switching over to Jobs. It is hopefully a keda configuration change, but not sure. It's not a big issue anyway.