Betterment / delayed

a multi-threaded, SQL-driven ActiveJob backend used at Betterment to process millions of background jobs per day

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Difference in error behavior when a job is undefined

daichi5 opened this issue · comments

Hi! We have migrated DelayedJob to Delayed and are running it.
While observing the worker, we came across an error and found a difference in behavior.

Here is the stack trace in the last_error column.

Job failed to load: undefined class/module Sample. Handler: "--- !ruby/object:Delayed::JobWrapper
job_data:
job_class: Sample
job_id: 0f9f613a-682b-439c-8169-d45a4984e8eb
provider_job_id:
queue_name: v2
priority:
arguments: &2 []
executions: 0
exception_executions: &1 {}
locale: en
timezone: UTC
enqueued_at: '2023-01-20T11:51:44Z'
job: !ruby/object:Sample
arguments: []
job_id: 0f9f613a-682b-439c-8169-d45a4984e8eb
queue_name: v2
priority:
executions: 0
exception_executions: *1
timezone: UTC
provider_job_id:
serialized_arguments: *2
locale: en
enqueued_at: '2023-01-20T11:51:44Z'
"
/usr/local/bundle/gems/delayed-0.4.0/lib/delayed/backend/base.rb:84:in rescue in payload_object' /usr/local/bundle/gems/delayed-0.4.0/lib/delayed/backend/base.rb:81:in payload_object'
/usr/local/bundle/gems/delayed-0.4.0/lib/delayed/backend/base.rb:133:in max_run_time' /usr/local/bundle/gems/delayed-0.4.0/lib/delayed/worker.rb:194:in max_run_time'
/usr/local/bundle/gems/delayed-0.4.0/lib/delayed/worker.rb:137:in block in run' /usr/local/lib/ruby/3.0.0/benchmark.rb:308:in realtime'
/usr/local/bundle/gems/delayed-0.4.0/lib/delayed/worker.rb:136:in run' /usr/local/bundle/gems/delayed-0.4.0/lib/delayed/worker.rb:210:in block in run_job'
/usr/local/bundle/gems/delayed-0.4.0/lib/delayed/lifecycle.rb:61:in block in initialize' /usr/local/bundle/gems/delayed-0.4.0/lib/delayed/lifecycle.rb:66:in execute'
/usr/local/bundle/gems/delayed-0.4.0/lib/delayed/lifecycle.rb:40:in run_callbacks' /usr/local/bundle/gems/delayed-0.4.0/lib/delayed/worker.rb:210:in run_job'
/usr/local/bundle/gems/delayed-0.4.0/lib/delayed/worker.rb:102:in block (4 levels) in work_off' /usr/local/bundle/gems/delayed-0.4.0/lib/delayed/plugins/connection.rb:7:in block (3 levels) in class:Connection'
/usr/local/bundle/gems/activerecord-6.1.6.1/lib/active_record/connection_adapters/abstract/connection_pool.rb:462:in with_connection' /usr/local/bundle/gems/delayed-0.4.0/lib/delayed/plugins/connection.rb:6:in block (2 levels) in class:Connection'
/usr/local/bundle/gems/delayed-0.4.0/lib/delayed/lifecycle.rb:79:in block (2 levels) in add' /usr/local/bundle/gems/delayed-0.4.0/lib/delayed/lifecycle.rb:61:in block in initialize'
/usr/local/bundle/gems/delayed-0.4.0/lib/delayed/lifecycle.rb:79:in block in add' /usr/local/bundle/gems/delayed-0.4.0/lib/delayed/lifecycle.rb:66:in execute'
/usr/local/bundle/gems/delayed-0.4.0/lib/delayed/lifecycle.rb:40:in run_callbacks' /usr/local/bundle/gems/delayed-0.4.0/lib/delayed/worker.rb:121:in run_thread_callbacks'
/usr/local/bundle/gems/delayed-0.4.0/lib/delayed/worker.rb:101:in block (3 levels) in work_off' /usr/local/bundle/gems/concurrent-ruby-1.1.10/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:352:in run_task'
/usr/local/bundle/gems/concurrent-ruby-1.1.10/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:343:in block (3 levels) in create_worker' /usr/local/bundle/gems/concurrent-ruby-1.1.10/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:334:in loop'
/usr/local/bundle/gems/concurrent-ruby-1.1.10/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:334:in block (2 levels) in create_worker' /usr/local/bundle/gems/concurrent-ruby-1.1.10/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:333:in catch'
/usr/local/bundle/gems/concurrent-ruby-1.1.10/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:333:in `block in create_worker'

This error occurs when a job is enqueued even though the class definition of the job does not exist in the worker.
(For example, this can occur when adding a new job definition and trying to synchronize the webapp, worker Deployment with ArgoCD.)

I noticed that some kind of error occurs when executing a job, whether it is a DelayedJob or Delayed, but the difference is that Delayed does not retry and the failed_at column is updated.

The situation of enqueuing a non-existent job is not good, but we expect the worker to retry and complete the job execution once the code is up-to-date.

I am assuming that this difference in behavior exists because of the difference in serialization and deserialization of jobs.

In the case of Delayed, the job key is added to the serialization string of the job, and an error seems to occur when deserializing it.
DeserializationError is handled so that it is not retried.
(In DelayedJob, there is no job key and no error occurs during deserialization.)

--- !ruby/object:Delayed::JobWrapper
job_data:
 job_class: Sample
 job_id: c5d50b73-c8c9-4df0-8ea3-d2f6ca45b85d
 provider_job_id: 
 queue_name: v2
 priority: 
 arguments: &2 []
 executions: 0
 exception_executions: &1 {}
 locale: en
 timezone: UTC
 enqueued_at: '2023-01-20T04:17:09Z'
job: !ruby/object:Sample
 arguments: []
 job_id: c5d50b73-c8c9-4df0-8ea3-d2f6ca45b85d
 queue_name: v2
 priority: 
 executions: 0
 exception_executions: *1
 timezone: UTC
 provider_job_id: 
 serialized_arguments: *2
 locale: en
 enqueued_at: '2023-01-20T04:17:09Z'

I assume that the job can be executed even if this job key does not exist.

Is there any intention behind this difference in specifications?

I thought one solution would be to not include the job key in the handler column. What do you think?👀

Hi @daichi5! Thanks for the report, and good catch! You're spot on -- that job: section is redundant and unnecessary, and was not the result of anything intentional. The @job ivar is simply there to support memoization of the instance of the payload class, which we delegate method calls to so that API methods like max_run_time and max_attempts are honored by ActiveJob-executed jobs.

Circling back! This behavior should be changed in the 0.5.0 release now.

@smudge
Hi, I have checked the recent 0.5.0 update.

Although undefined jobs are now retried, the rescheduling does not seem to be working properly for undefined jobs as a NameError is generated when job.max_attempts is referenced.

As a result, attempts value is not incremented and the job is retried indefinitely. Could you please check this?