mojolicious / minion

Minion version: 9.13
Perl version: 5.30
Operating system: CentOS Linux release 7.5.1804 (Core)

Steps to reproduce the behavior

Here is the code for the task I am running (I know shelling out work to other servers over SSH isn't the ideal way to use Minion...it's a stepping stone :) ). Also, I don't think this code is relevant to the issue, but for completeness:

app->minion->add_task(system_command => sub {
    my ($job, $lock_name, $host, $command, @args) = @_;

    my $guard;
    if ($lock_name) {
        return $job->retry({attempts => 100, delay => 10})
            unless $guard = $job->minion->guard($lock_name, 86_400);
    }

    my $ssh = Net::OpenSSH->new($host);
    if ($ssh->error) {
        die qq{Failed to establish ssh connection to host '$host' } . $ssh->error;
    }

    my ($stdout, $stderr) = $ssh->capture2({timeout => 10}, $command, @args);
    my $result = { stdout => $stdout, stderr => $stderr };

    unless ($ssh->system($command, @args)) {
        $result->{ssh_error} = $ssh->error;
        return $job->fail($result);
    }

    return $job->finish($result);
});

Here is the command that I'm running to start my worker:

perl minion.pl minion worker -m production -j 4

Expected behavior

I expect that once jobs are completed, their processes should be reaped and disappear from the process list.

Actual behavior

I see many defunct perl processes under Minion:

However, I would think that these lines in Minion::Worker:

minion/lib/Minion/Worker.pm

Lines 122 to 124 in dcc6146

    
           my $jobs = $self->{jobs} ||= {}; 
        
           $jobs->{$_}->is_finished and ++$status->{performed} and delete $jobs->{$_} 
        
             for keys %$jobs;

And this is_finished Minion::Job:

minion/lib/Minion/Job.pm

Lines 36 to 41 in 68ae840

    
           sub is_finished { 
        
             my $self = shift; 
        
             return undef unless waitpid($self->{pid}, WNOHANG) == $self->{pid}; 
        
             $self->_handle; 
        
             return 1; 
        
           }

Should mean that this does not occur since the jobs are calling waitpid.

Can you replicate the issue without using 3rd party modules?

@kraih Good point. I will give that a try and get back to you.

So I used perl's system command instead of the Net::OpenSSH module, and I still see these defunct processes. There is one part of my code that I took out to simplify things, but it seems like it may be relevant.

I didn't want to start any new Minion jobs between 4 and 7 am, so I delay any jobs that start during that period until 7 AM the next day. I ran Minion all day and didn't have any defunct processes until then (I believe this is also what I saw when using Net::OpenSSH). Here's my updated code just using system and including the delay jobs until 7:

app->minion->add_task(system_command => sub {
    my ($job, $lock_name, $host, $command, @args) = @_; 

    my $now = Time::Moment->now;
    my $hour = $now->hour;
    if ($hour >= 4 and $hour < 7) {
        my $seven_am = Time::Moment->new(
            year       => $now->year,
            month      => $now->month,
            day        => $now->day_of_month,
            hour       => 7,
            minute     => 0,
            second     => 0,
            nanosecond => 0,
            offset     => $now->offset,
        );  

        return $job->retry({attempts => 100, delay => $now->delta_seconds($seven_am)});
    }   

    my $guard;
    if ($lock_name) {
        return $job->retry({attempts => 100, delay => 10}) unless $guard = $job->minion->guard($lock_name, 86_400);
    }   

    my $exit = system('ssh', $host, $command, @args);
    my $result = { exit => $exit };

    if ($exit) {
        $result->{ssh_error} = "ssh error: $?";
        return $job->fail($result);
    }   

    return $job->finish($result);
});

Here's all of the jobs starting at 7:

There needs to be a minimal test case before this can be investigated.

It seems like it should be possible to remove the uses of system() and Time::Moment entirely, and boil it down into a simple script? Does the problem still exist then? If not, perhaps your external processes are the zombies you are seeing, and you can reproduce that without using minion at all?

@kraih Sorry-- I added more details because I thought they may be relevant. I will working on making a more minimal test case.

@karenetheridge I will boil it down to a simple script and see if it still occurs. I wasn't sure if maybe Minion didn't play well with jobs that fork and introduce their own SIG handlers for the process, like the system documentation mentions:

Since system does a fork and wait it may affect a SIGCHLD handler.

But system should wait for jobs that it forks, so that shouldn't be a problem. And the forked Minion::Job is the one whose SIG handlers may be affected, but my understanding is that it should be the parent process (Minion::Worker) not the child that's waiting on any processes, and its SIG handlers shouldn't be affected.

I'll make a minimum script like you mentioned and maybe that will make things clearer.

Okay, so I think I've figured out what's going on here. Here is a minimal test case:

app->minion->add_task(retry => sub {
    my ($job) = @_;

    return $job->retry({attempts => 100, delay => -1});
});

Then on the command line:

perl minion.pl minion job -e retry

To add one job. It may help to do this a few times. Then start some workers:

perl minion.pl minion worker -j 3

And you should start to see the defunct processes begin to accumulate.

The reason I believe that this happens is because of these lines in Minion::Worker:

minion/lib/Minion/Worker.pm

Lines 130 to 133 in dcc6146

    
           # Try to get more jobs 
        
           my ($max, $queues) = @{$status}{qw(dequeue_timeout queues)}; 
        
           my $job = $self->emit('wait')->dequeue($max => {queues => $queues}); 
        
           $jobs->{$job->id} = $job->start if $job;

Minion::Worker dequeues a job that it already has and overwrites it in the $jobs hash here:

minion/lib/Minion/Worker.pm

Line 133 in dcc6146

$jobs->{$job->id} = $job->start if $job;

This means that is_finished is never successfully called on that job, which means that waitpid is never successfully called on that job's pid and you get a defunct zombie perl process.

This code below is not a good permanent fix, but it verifies that this is the problem. When you replace line 133 above with:

  if ($job and exists $jobs->{$job->id}) {
    $job->app->log->info('job ' . $job->id . ' already existed!');
    $job->retry({delay => 1});
  } else {
    $jobs->{$job->id} = $job->start if $job;
  }

You will get log messages like:

[2019-11-19 22:40:46.93925] [11701] [info] 59 already existed!

And no defunct perl processes will show up since waitpid will be successfully called on the job's existing pid before dequeueing the same job.

Yes, i see the problem now. But your proposed fix is not good.

@kraih I agree. That "fix" was just meant to demonstrate the problem:

This code below is not a good permanent fix, but it verifies that this is the problem.

That should fix it.

Thank you!!

	my $jobs = $self->{jobs} \|\|= {};
	$jobs->{$_}->is_finished and ++$status->{performed} and delete $jobs->{$_}
	for keys %$jobs;

	sub is_finished {
	my $self = shift;
	return undef unless waitpid($self->{pid}, WNOHANG) == $self->{pid};
	$self->_handle;
	return 1;
	}

	# Try to get more jobs
	my ($max, $queues) = @{$status}{qw(dequeue_timeout queues)};
	my $job = $self->emit('wait')->dequeue($max => {queues => $queues});
	$jobs->{$job->id} = $job->start if $job;

worker leaving defunct processes

Steps to reproduce the behavior

Expected behavior

Actual behavior