bug: crons with concurrency limits set cause the engine to crash

Question

bug: crons with concurrency limits set cause the engine to crash

trisongz opened this issue 2 months ago · comments

trisongz commented 2 months ago

I have hatchet self-hosted in a K8s cluster.

Container Images:

engine: ghcr.io/hatchet-dev/hatchet/hatchet-engine:v0.26.1
api: ghcr.io/hatchet-dev/hatchet/hatchet-api:v0.26.1
rabbitmq: docker.io/bitnami/rabbitmq:3.13.2-debian-12-r0

SDK: Python - hatchet-sdk-0.23.0 (0.22.5 prior)

After version 0.23.0, I've consistently run into the following issue when a cron task gets triggered, which then causes a reboot loop on the engine container:

2024-05-10T15:34:53.555Z INF workflow 491b44e5-34ad-4764-847b-fedf8f838362 has concurrency settings service=workflows-controller
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x78 pc=0x14053cf]

goroutine 41 [running]:
github.com/hatchet-dev/hatchet/internal/services/controllers/workflows.(*WorkflowsControllerImpl).scheduleGetGroupAction(0xc00372ef50, {0x1b11638?, 0xc000751320?}, 0x0)
	/hatchet/internal/services/controllers/workflows/queue.go:211 +0xcf
github.com/hatchet-dev/hatchet/internal/services/controllers/workflows.(*WorkflowsControllerImpl).handleWorkflowRunQueued(0xc00372ef50, {0x1b11590?, 0x3d2f720?}, 0xc000750900)
	/hatchet/internal/services/controllers/workflows/queue.go:72 +0x618
github.com/hatchet-dev/hatchet/internal/services/controllers/workflows.(*WorkflowsControllerImpl).handleTask(0xc0004b3ad0?, {0x1b11590, 0x3d2f720}, 0xc000750900)
	/hatchet/internal/services/controllers/workflows/controller.go:199 +0x117
github.com/hatchet-dev/hatchet/internal/services/controllers/workflows.(*WorkflowsControllerImpl).Start.func1(0xc0007ba000?)
	/hatchet/internal/services/controllers/workflows/controller.go:161 +0x90
github.com/hatchet-dev/hatchet/internal/msgqueue/rabbitmq.(*MessageQueueImpl).subscribe.func1.2({{0x1b0f940, 0xc00062c7e0}, 0x0, {0x0, 0x0}, {0x0, 0x0}, 0x0, 0x0, {0x0, ...}, ...})
	/hatchet/internal/msgqueue/rabbitmq/rabbitmq.go:502 +0x88b
created by github.com/hatchet-dev/hatchet/internal/msgqueue/rabbitmq.(*MessageQueueImpl).subscribe.func1 in goroutine 154
	/hatchet/internal/msgqueue/rabbitmq/rabbitmq.go:451 +0x5c6

I've attempted the following to debug:

Delete all workflows, which deletes the cron schedules and allows the engine container to get back up.
Recreate rabbitmq, including the persistent data, which doesn't do anything.

I am able to trigger the workflow manually, but whenever the cron schedule triggers the workflow, that issue occurs.

abelanger5 · Answer 1 · Sat May 11 2024 00:55:30 GMT+0800 (China Standard Time)

Hey @trisongz, thanks for the report - I'll be taking a look at this today. This looks like an issue with the workflow run not being created properly from cron workflows if you have a concurrency limit setting on the workflow run. This isn't an issue with RabbitMQ, so no need to restart things on that side (the methods are just being triggered by a RabbitMQ message).

trisongz · Answer 2 · Sat May 11 2024 02:20:43 GMT+0800 (China Standard Time)

Thanks for the response, I was able to confirm that after removing concurrency from it, that the latest version works.

abelanger5 · Answer 3 · Tue May 14 2024 04:16:33 GMT+0800 (China Standard Time)

This is fixed in v0.26.2