Enqueue timeout occurs when 2000 jobs are registered
dkuji opened this issue · comments
dkron version: 3.2.1
OS: RockyLinux 8.7
I have configured a cluster of 3 nodes.
Registering 2000 jobs caused hundreds of enqueue timeouts.
The job is a test job that simply executes the echo command.
Apr 27 12:10:39 host-002 dkron[309144]: time="2023-04-27T12:10:39+09:00" level=error msg="job: Error running job" error="agent: Run error storing job test-job-performance-1249 before running: timed out enqueuing operation" node=host-002
Apr 27 12:10:39 host-002 dkron[309144]: time="2023-04-27T12:10:39+09:00" level=error msg="job: Error running job" error="agent: Run error storing job test-job-performance-389 before running: timed out enqueuing operation" node=host-002
And, the dkron service then stops.
Apr 27 12:12:04 host-002 dkron[309144]: time="2023-04-27T12:12:04+09:00" level=fatal msg="agent: error applying SetExecutionType" error="timed out enqueuing operation" node=host-002
Apr 27 12:12:05 host-002 systemd[1]: dkron.service: Main process exited, code=exited, status=1/FAILURE
Apr 27 12:12:05 host-002 systemd[1]: dkron.service: Failed with result 'exit-code'.
Apr 27 12:12:05 host-002 systemd[1]: dkron.service: Service RestartSec=100ms expired, scheduling restart.
Apr 27 12:12:05 host-002 systemd[1]: dkron.service: Scheduled restart job, restart counter is at 2.
Apr 27 12:12:05 host-002 systemd[1]: Stopped Dkron Agent.
Apr 27 12:12:15 host-002 systemd[1]: Started Dkron Agent.
Also, when I perform operations to register or delete jobs from the API or Web UI in this state, the response takes more than one minute.
Is there any way to avoid this error?
We have confirmed that there is enough CPU and memory available on all nodes.
Something similar is happening in our environment as well. We run single node dkron as a statefulset in K8S. The pod has enough CPU and Mem.
Stream of "timed out enqueuing operation" messages and then finally:
level=error msg="job: Error running job" error="agent: Run error storing job xxxx before running: timed out enqueuing operation" node=dkron-0
...
...
time="2023-05-06T18:30:39Z" level=fatal msg="agent: error applying SetExecutionType" error="timed out enqueuing operation" node=dkron-0
@vcastellm , @yvanoers any ideas here ?
How are you @nikunj-badjatya @dkuji registering the jobs?
@vcastellm
I have registered a job through the API.
The job executes an echo command at 0 seconds every minute.
Yes the jobs are registered via API only.
But it seems like there is no correlation between when jobs are registered and when pod is restarted.
Restart happened again today for us.
Any updates on this @vcastellm ? We are seeing restarts because of this once a week avg.
We observed that when the load (number of schedules) are reduced, the restarts are gone.
We were having more than 200K schedules and now just under 30K.
@vcastellm any insights here ?
@nikunj-badjatya this means that you've reached the limit of the system performance with that hardware configuration using vertical scalability.
My recommendation at this point is to add more Dkron clusters and share the load between them.