Enqueue timeout occurs when 2000 jobs are registered

Question

Enqueue timeout occurs when 2000 jobs are registered

dkuji opened this issue 2 years ago · comments

dkron version: 3.2.1
OS: RockyLinux 8.7

I have configured a cluster of 3 nodes.

Registering 2000 jobs caused hundreds of enqueue timeouts.
The job is a test job that simply executes the echo command.

Apr 27 12:10:39 host-002 dkron[309144]: time="2023-04-27T12:10:39+09:00" level=error msg="job: Error running job" error="agent: Run error storing job test-job-performance-1249 before running: timed out enqueuing operation" node=host-002
Apr 27 12:10:39 host-002 dkron[309144]: time="2023-04-27T12:10:39+09:00" level=error msg="job: Error running job" error="agent: Run error storing job test-job-performance-389 before running: timed out enqueuing operation" node=host-002

And, the dkron service then stops.

Apr 27 12:12:04 host-002 dkron[309144]: time="2023-04-27T12:12:04+09:00" level=fatal msg="agent: error applying SetExecutionType" error="timed out enqueuing operation" node=host-002
Apr 27 12:12:05 host-002 systemd[1]: dkron.service: Main process exited, code=exited, status=1/FAILURE
Apr 27 12:12:05 host-002 systemd[1]: dkron.service: Failed with result 'exit-code'.
Apr 27 12:12:05 host-002 systemd[1]: dkron.service: Service RestartSec=100ms expired, scheduling restart.
Apr 27 12:12:05 host-002 systemd[1]: dkron.service: Scheduled restart job, restart counter is at 2.
Apr 27 12:12:05 host-002 systemd[1]: Stopped Dkron Agent.
Apr 27 12:12:15 host-002 systemd[1]: Started Dkron Agent.

Also, when I perform operations to register or delete jobs from the API or Web UI in this state, the response takes more than one minute.

Is there any way to avoid this error?

We have confirmed that there is enough CPU and memory available on all nodes.

Nikunj Badjatya · Answer 1 · Tue May 09 2023 07:23:55 GMT+0800 (China Standard Time)

Something similar is happening in our environment as well. We run single node dkron as a statefulset in K8S. The pod has enough CPU and Mem.

Stream of "timed out enqueuing operation" messages and then finally:

level=error msg="job: Error running job" error="agent: Run error storing job xxxx before running: timed out enqueuing operation" node=dkron-0
...
...
time="2023-05-06T18:30:39Z" level=fatal msg="agent: error applying SetExecutionType" error="timed out enqueuing operation" node=dkron-0

@vcastellm , @yvanoers any ideas here ?

Victor Castell · Answer 2 · Sun May 14 2023 18:30:53 GMT+0800 (China Standard Time)

How are you @nikunj-badjatya @dkuji registering the jobs?

dkuji · Answer 3 · Thu May 18 2023 10:08:34 GMT+0800 (China Standard Time)

@vcastellm
I have registered a job through the API.
The job executes an echo command at 0 seconds every minute.

Nikunj Badjatya · Answer 4 · Fri May 19 2023 05:59:52 GMT+0800 (China Standard Time)

Yes the jobs are registered via API only.
But it seems like there is no correlation between when jobs are registered and when pod is restarted.
Restart happened again today for us.

@vcastellm

Nikunj Badjatya · Answer 5 · Wed May 31 2023 04:24:38 GMT+0800 (China Standard Time)

Any updates on this @vcastellm ? We are seeing restarts because of this once a week avg.

Nikunj Badjatya · Answer 6 · Thu Jun 22 2023 13:44:19 GMT+0800 (China Standard Time)

We observed that when the load (number of schedules) are reduced, the restarts are gone.
We were having more than 200K schedules and now just under 30K.
@vcastellm any insights here ?

Victor Castell · Answer 7 · Mon Jul 24 2023 01:37:43 GMT+0800 (China Standard Time)

@nikunj-badjatya this means that you've reached the limit of the system performance with that hardware configuration using vertical scalability.

My recommendation at this point is to add more Dkron clusters and share the load between them.