rfdonnelly / jobrnr

Jobrnr runs jobs.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Errno::ESRCH on 3rd Ctrl-C

rfdonnelly opened this issue · comments

Ctrl-C handling needs more work.

Stopping job submission. Allowing active jobs to finish.
Ctrl-C again to terminate active jobs gracefully.
^C
Terminating by sending Ctrl-C (SIGINT) to jobs.
Ctrl-C again to send Ctrl-C (SIGINT) again.
^C
Terminating by sending Ctrl-C (SIGINT) to jobs.
Ctrl-C again to send Ctrl-C (SIGINT) again.
Traceback (most recent call last):
        9: from ./jobrnr:6:in `<main>'
        8: from ./lib/jobrnr/application.rb:16:in `run'
        7: from ./lib/jobrnr/application.rb:57:in `run_with_exceptions'
        6: from ./lib/jobrnr/job/dispatch.rb:94:in `run'
        5: from ./lib/jobrnr/job/dispatch.rb:94:in `sleep'
        4: from ./lib/jobrnr/job/dispatch.rb:49:in `block in trap_ctrl_c'
        3: from ./lib/jobrnr/job/pool.rb:36:in `sigint'
        2: from ./lib/jobrnr/job/pool.rb:36:in `each'
        1: from ./lib/jobrnr/job/instance.rb:52:in `sigint'
./lib/jobrnr/job/instance.rb:52:in `kill': No such process (Errno::ESRCH)

On the first Ctrl-C, job submission was stopped and active jobs were allowed to finish. On the second Ctrl-C, a SIGINT was sent to active jobs. The jobs terminated but due to a bug, Jobrnr thought they were still active. On the third Ctrl-C, another SIGINT was sent to "active" jobs. However, the jobs were no longer active and thus the "no such process" exception.

This is due to a combination of two issues:

  1. Nil exitstatus for signaled process (root cause)

    When a process is terminated via a SIGINT, it doesn't have an exitstatus (the exitstatus is nil). So checking for pass/fail of a process terminated via a SIGINT by using exitstatus.zero? attempted to call the zero? method on a nil value and thus caused an exception.

  2. Exceptions in Futures are silent

    Futures are used to execute multiple jobs concurrently. Jobrnr polls for the completion of jobs by looking for Futures in the fulfilled state. When all futures are fulfilled and the job queue is empty, everything is complete and Jobrnr terminates. However, if an exception occurs in a Future, the exception is local to the Future thread (i.e. it is not raised in the parent thread) and Future goes to the 'rejected' state. These silent exceptions caused Jobrnr to hang.

The fix for this should address both the root cause and the hang due to silent exceptions.