Incorrect count for attempts upon ultimate failure?

Question

Incorrect count for attempts upon ultimate failure?

christopherraa opened this issue 3 years ago · comments

Christopher Rasch-Olsen Raa commented 3 years ago

Minion version: 10.22
Perl version: v5.30.0
Operating system: Ubuntu 21.04

Steps to reproduce the behavior

Enqueue a job that should fail after a set number of attempts
Wait for it to progress to a failed state, meaning all attempts should be exhaused
Check if attempts has reached 0

The following code demonstrates the issue:

#!/usr/bin/env perl

use Mojo::Base -strict;
use Mojo::Pg;
use Mojolicious::Lite;
use Test::Mojo;
use Test2::V0;

my $t  = Test::Mojo->new;

helper pg     => sub { state $pg = Mojo::Pg->new('postgresql:///') };
plugin Minion => { Pg => app->pg };

my $attempts = 0;
app->minion->add_task(dies => sub { $attempts++; shift->fail(['ouch']) });
app->minion->backoff(sub { 0 });
my $id = app->minion->enqueue(dies => [] => { attempts => 10});


my $task     = app->pg->db->select('minion_jobs', '*', { id => $id })->expand->hash;
my $expected = { attempts => 10, retries => 0, result => undef, state => 'inactive' };
like $task, $expected, 'task should be inactive and have 10 remaining attempts';

app->minion->perform_jobs_in_foreground();

$task     = app->pg->db->select('minion_jobs', '*', { id => $id })->expand->hash;
$expected = { attempts => 0, retries => 9, result => ['ouch'], state => 'failed' };
like $task, $expected, 'task should have zero attempts remaining and be retried 9 times';
is $attempts, 10, 'ten attempts recorded';

done_testing();

Expected behavior

I would expect attempts on the job to be set to 0 when all retries has been exhaused, and retries be set to <attempts> - 1 as that would be the actual number of retries. In other words: total number of attempts = the initial attempt + nine retries.

Expected output from the above script:

# Seeded srand with seed '20210809' from local date.
ok 1 - task should be inactive and have 10 remaining attempts
ok 2 - task should have zero attempts remaining and be retried 9 times
ok 3 - ten attempts recorded
1..3

Actual behavior

The task is marked with the correct number of performed retries, but remaining attempts stops at 1 instead of 0.

Actual output from the script:

# Seeded srand with seed '20210809' from local date.
ok 1 - task should be inactive and have 10 remaining attempts
not ok 2 - task should have zero attempts remaining and be retried 9 times
# Failed test 'task should have zero attempts remaining and be retried 9 times'
# at ./retries-test.t line 28.
# +------------+-----+----+-------+
# | PATH       | GOT | OP | CHECK |
# +------------+-----+----+-------+
# | {attempts} | 1   | eq | 0     |
# +------------+-----+----+-------+
ok 3 - ten attempts recorded
1..3
# Looks like you failed 1 test of 3.

Given that attempts steadily decrease as the job is retried I would assume that attempts hit zero when no remaining attempts were left.

To me it seems like the meaning of attempts kind of changes during the task lifecycle. By that I mean that attempts starts out as "how many times the jobs is to be attempted before transitioning to a failed state" but after the task has transitioned to failed the meaning is suddenly "indication that the task had one initial attempt" so that if you summarize attempts and retries you get the same number as initially passed in the attempts-attribute to ->enqueue().

Christopher Rasch-Olsen Raa · Answer 1 · Tue Aug 10 2021 04:21:47 GMT+0800 (China Standard Time)

Thinking about this further it seems like attempts has three distinct meanings:

before performing task: "how many times the jobs is to be attempted before transitioning to a failed state"
during task execution: "remaining attempts before giving up"
after transitioning to failed: "indication that the task had one initial attempt"

Sebastian Riedel · Answer 2 · Tue Aug 10 2021 05:26:10 GMT+0800 (China Standard Time)

Happy to consider a PR.

Christopher Rasch-Olsen Raa · Answer 3 · Tue Aug 10 2021 16:06:09 GMT+0800 (China Standard Time)

I'm a bit usure what would be the "correct" behaviour and even a solution that could be accepted. The way I see this issue it is hard to solve it witout introducing breaking changes in some way.

The simplest solution that I can think of would be to let attempts reach zero when the last attempt is done and make sure that the documentation clearly say that attempts denotes "the number of remaining attempts".

Another solution would be to make a distinction between "the number of attempts each job is configured with" and "the number of remaining attempts". This can be achieved by adding a new property of the job that hold the number of attempts that was initially configured for the job, which would not change during job execution / retries. Naming is hard but names such as total_attempts, attempt_limit or something of the sort could be considered. attempts would then still signify "the number of remaining attempts" and would at time of enqueue() be set to the same value as attempt_limit, which whould default to 1 to keep things similar to how it is today. In any case I think I would like to see attempts hit zero once all attempts were exhausted. The reason I'd like to see this separation of fields is that with this solution none of them would change what they signify during the job lifetime.

Do you have any thoughts or preferences here?