opendatacube / datacube-alchemist

Dataset to Dataset Transformations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

processing pod crashloopbackoff need error handling

pindge opened this issue · comments

Problem

alchemist deployments are largely sitting in Crashlookbackoff state. Ideally it should gracefully complete if there it cannot start properly due to no available message in queue.

The number next to Crashloopbackoff is the number of times the pod was started,

alchemist-s2-nrt-processing-wo-5655f489bd-4xwj5   0/1     CrashLoopBackOff   1271       5d18h
alchemist-s2-nrt-processing-wo-5655f489bd-7kcbw   0/1     CrashLoopBackOff   1268       5d18h
alchemist-s2-nrt-processing-wo-5655f489bd-7x7wm   0/1     CrashLoopBackOff   439        45h
alchemist-s2-nrt-processing-wo-5655f489bd-c7twg   0/1     CrashLoopBackOff   1312       6d
alchemist-s2-nrt-processing-wo-5655f489bd-crsch   0/1     CrashLoopBackOff   1313       6d
alchemist-s2-nrt-processing-wo-5655f489bd-fmqj9   0/1     CrashLoopBackOff   1274       5d18h
alchemist-s2-nrt-processing-wo-5655f489bd-gb7cp   0/1     CrashLoopBackOff   1270       5d18h
alchemist-s2-nrt-processing-wo-5655f489bd-grlq7   0/1     CrashLoopBackOff   437        45h
alchemist-s2-nrt-processing-wo-5655f489bd-l56pd   0/1     CrashLoopBackOff   1317       6d
alchemist-s2-nrt-processing-wo-5655f489bd-m75cb   0/1     CrashLoopBackOff   1317       6d
alchemist-s2-nrt-processing-wo-5655f489bd-mhrl6   0/1     CrashLoopBackOff   1216       5d12h
alchemist-s2-nrt-processing-wo-5655f489bd-nn8mq   0/1     CrashLoopBackOff   1272       5d18h
alchemist-s2-nrt-processing-wo-5655f489bd-pjsj2   0/1     CrashLoopBackOff   1314       6d
alchemist-s2-nrt-processing-wo-5655f489bd-qhgjl   0/1     CrashLoopBackOff   1272       5d18h
alchemist-s2-nrt-processing-wo-5655f489bd-qp2m5   0/1     CrashLoopBackOff   1315       6d
alchemist-s2-nrt-processing-wo-5655f489bd-rgm65   0/1     CrashLoopBackOff   436        45h
alchemist-s2-nrt-processing-wo-5655f489bd-rhdnk   0/1     CrashLoopBackOff   335        34h
alchemist-s2-nrt-processing-wo-5655f489bd-rq7lv   0/1     CrashLoopBackOff   338        34h
alchemist-s2-nrt-processing-wo-5655f489bd-w8smj   0/1     CrashLoopBackOff   438        45h
alchemist-s2-nrt-processing-wo-5655f489bd-wh5vl   0/1     CrashLoopBackOff   1317       6d

Proposed solution

need to add exception handling here

def run_from_queue(config_file, queue, limit, queue_timeout, dryrun, sns_arn):
"""
Process messages from the given queue
"""
alchemist = Alchemist(config_file=config_file)
tasks_and_messages = alchemist.get_tasks_from_queue(queue, limit, queue_timeout)
errors = 0
successes = 0
for task, message in tasks_and_messages:

To me this sounds more like an issue with Kubernetes and KEDA than with Alchemist itself.

If there are no messages in the queue, there shouldn't be more pods started.

If Alchemist is run against a queue that is empty, I think it should exit cleanly (which I think is what it already does). If this is causing issues with the K8s Pods, I think they need their configuration changed.

Yeah, the issue is that it gets restarted en exits cleanly, but then k8s restarts it again.

This was resolve by Kirill in Statistician by switching over to Jobs. It is hopefully a keda configuration change, but not sure. It's not a big issue anyway.