bug: crons with concurrency limits set cause the engine to crash
trisongz opened this issue · comments
I have hatchet self-hosted in a K8s cluster.
Container Images:
- engine:
ghcr.io/hatchet-dev/hatchet/hatchet-engine:v0.26.1
- api:
ghcr.io/hatchet-dev/hatchet/hatchet-api:v0.26.1
- rabbitmq:
docker.io/bitnami/rabbitmq:3.13.2-debian-12-r0
SDK: Python - hatchet-sdk-0.23.0
(0.22.5
prior)
After version 0.23.0
, I've consistently run into the following issue when a cron task gets triggered, which then causes a reboot loop on the engine
container:
2024-05-10T15:34:53.555Z INF workflow 491b44e5-34ad-4764-847b-fedf8f838362 has concurrency settings service=workflows-controller
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x78 pc=0x14053cf]
goroutine 41 [running]:
github.com/hatchet-dev/hatchet/internal/services/controllers/workflows.(*WorkflowsControllerImpl).scheduleGetGroupAction(0xc00372ef50, {0x1b11638?, 0xc000751320?}, 0x0)
/hatchet/internal/services/controllers/workflows/queue.go:211 +0xcf
github.com/hatchet-dev/hatchet/internal/services/controllers/workflows.(*WorkflowsControllerImpl).handleWorkflowRunQueued(0xc00372ef50, {0x1b11590?, 0x3d2f720?}, 0xc000750900)
/hatchet/internal/services/controllers/workflows/queue.go:72 +0x618
github.com/hatchet-dev/hatchet/internal/services/controllers/workflows.(*WorkflowsControllerImpl).handleTask(0xc0004b3ad0?, {0x1b11590, 0x3d2f720}, 0xc000750900)
/hatchet/internal/services/controllers/workflows/controller.go:199 +0x117
github.com/hatchet-dev/hatchet/internal/services/controllers/workflows.(*WorkflowsControllerImpl).Start.func1(0xc0007ba000?)
/hatchet/internal/services/controllers/workflows/controller.go:161 +0x90
github.com/hatchet-dev/hatchet/internal/msgqueue/rabbitmq.(*MessageQueueImpl).subscribe.func1.2({{0x1b0f940, 0xc00062c7e0}, 0x0, {0x0, 0x0}, {0x0, 0x0}, 0x0, 0x0, {0x0, ...}, ...})
/hatchet/internal/msgqueue/rabbitmq/rabbitmq.go:502 +0x88b
created by github.com/hatchet-dev/hatchet/internal/msgqueue/rabbitmq.(*MessageQueueImpl).subscribe.func1 in goroutine 154
/hatchet/internal/msgqueue/rabbitmq/rabbitmq.go:451 +0x5c6
I've attempted the following to debug:
- Delete all workflows, which deletes the cron schedules and allows the
engine
container to get back up. - Recreate
rabbitmq
, including the persistent data, which doesn't do anything.
I am able to trigger the workflow manually, but whenever the cron schedule triggers the workflow, that issue occurs.
Hey @trisongz, thanks for the report - I'll be taking a look at this today. This looks like an issue with the workflow run not being created properly from cron workflows if you have a concurrency limit setting on the workflow run. This isn't an issue with RabbitMQ, so no need to restart things on that side (the methods are just being triggered by a RabbitMQ message).
Thanks for the response, I was able to confirm that after removing concurrency from it, that the latest version works.
This is fixed in v0.26.2