worker leaving defunct processes
srchulo opened this issue · comments
- Minion version: 9.13
- Perl version: 5.30
- Operating system: CentOS Linux release 7.5.1804 (Core)
Steps to reproduce the behavior
Here is the code for the task I am running (I know shelling out work to other servers over SSH isn't the ideal way to use Minion...it's a stepping stone :) ). Also, I don't think this code is relevant to the issue, but for completeness:
app->minion->add_task(system_command => sub {
my ($job, $lock_name, $host, $command, @args) = @_;
my $guard;
if ($lock_name) {
return $job->retry({attempts => 100, delay => 10})
unless $guard = $job->minion->guard($lock_name, 86_400);
}
my $ssh = Net::OpenSSH->new($host);
if ($ssh->error) {
die qq{Failed to establish ssh connection to host '$host' } . $ssh->error;
}
my ($stdout, $stderr) = $ssh->capture2({timeout => 10}, $command, @args);
my $result = { stdout => $stdout, stderr => $stderr };
unless ($ssh->system($command, @args)) {
$result->{ssh_error} = $ssh->error;
return $job->fail($result);
}
return $job->finish($result);
});
Here is the command that I'm running to start my worker:
perl minion.pl minion worker -m production -j 4
Expected behavior
I expect that once jobs are completed, their processes should be reaped and disappear from the process list.
Actual behavior
I see many defunct perl processes under Minion:
However, I would think that these lines in Minion::Worker:
Lines 122 to 124 in dcc6146
And this is_finished
Minion::Job:
Lines 36 to 41 in 68ae840
Should mean that this does not occur since the jobs are calling waitpid
.
Can you replicate the issue without using 3rd party modules?
@kraih Good point. I will give that a try and get back to you.
So I used perl's system command instead of the Net::OpenSSH module, and I still see these defunct processes. There is one part of my code that I took out to simplify things, but it seems like it may be relevant.
I didn't want to start any new Minion jobs between 4 and 7 am, so I delay any jobs that start during that period until 7 AM the next day. I ran Minion all day and didn't have any defunct processes until then (I believe this is also what I saw when using Net::OpenSSH
). Here's my updated code just using system
and including the delay jobs until 7:
app->minion->add_task(system_command => sub {
my ($job, $lock_name, $host, $command, @args) = @_;
my $now = Time::Moment->now;
my $hour = $now->hour;
if ($hour >= 4 and $hour < 7) {
my $seven_am = Time::Moment->new(
year => $now->year,
month => $now->month,
day => $now->day_of_month,
hour => 7,
minute => 0,
second => 0,
nanosecond => 0,
offset => $now->offset,
);
return $job->retry({attempts => 100, delay => $now->delta_seconds($seven_am)});
}
my $guard;
if ($lock_name) {
return $job->retry({attempts => 100, delay => 10}) unless $guard = $job->minion->guard($lock_name, 86_400);
}
my $exit = system('ssh', $host, $command, @args);
my $result = { exit => $exit };
if ($exit) {
$result->{ssh_error} = "ssh error: $?";
return $job->fail($result);
}
return $job->finish($result);
});
Here's all of the jobs starting at 7:
There needs to be a minimal test case before this can be investigated.
It seems like it should be possible to remove the uses of system()
and Time::Moment entirely, and boil it down into a simple script? Does the problem still exist then? If not, perhaps your external processes are the zombies you are seeing, and you can reproduce that without using minion at all?
@kraih Sorry-- I added more details because I thought they may be relevant. I will working on making a more minimal test case.
@karenetheridge I will boil it down to a simple script and see if it still occurs. I wasn't sure if maybe Minion didn't play well with jobs that fork and introduce their own SIG
handlers for the process, like the system documentation mentions:
Since system does a fork and wait it may affect a SIGCHLD handler.
But system
should wait for jobs that it forks, so that shouldn't be a problem. And the forked Minion::Job
is the one whose SIG
handlers may be affected, but my understanding is that it should be the parent process (Minion::Worker
) not the child that's waiting on any processes, and its SIG
handlers shouldn't be affected.
I'll make a minimum script like you mentioned and maybe that will make things clearer.
Okay, so I think I've figured out what's going on here. Here is a minimal test case:
app->minion->add_task(retry => sub {
my ($job) = @_;
return $job->retry({attempts => 100, delay => -1});
});
Then on the command line:
perl minion.pl minion job -e retry
To add one job. It may help to do this a few times. Then start some workers:
perl minion.pl minion worker -j 3
And you should start to see the defunct processes begin to accumulate.
The reason I believe that this happens is because of these lines in Minion::Worker
:
Lines 130 to 133 in dcc6146
Minion::Worker
dequeues a job that it already has and overwrites it in the $jobs
hash here:
Line 133 in dcc6146
This means that is_finished
is never successfully called on that job, which means that waitpid
is never successfully called on that job's pid and you get a defunct zombie perl process.
This code below is not a good permanent fix, but it verifies that this is the problem. When you replace line 133 above with:
if ($job and exists $jobs->{$job->id}) {
$job->app->log->info('job ' . $job->id . ' already existed!');
$job->retry({delay => 1});
} else {
$jobs->{$job->id} = $job->start if $job;
}
You will get log messages like:
[2019-11-19 22:40:46.93925] [11701] [info] 59 already existed!
And no defunct perl processes will show up since waitpid
will be successfully called on the job's existing pid before dequeueing the same job.
Yes, i see the problem now. But your proposed fix is not good.
@kraih I agree. That "fix" was just meant to demonstrate the problem:
This code below is not a good permanent fix, but it verifies that this is the problem.
That should fix it.
Thank you!!