shikokuchuo / mirai

mirai - Minimalist Async Evaluation Framework for R

Home Page:https://shikokuchuo.net/mirai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Allow daemons to exit immediately when the connection terminates?

wlandau opened this issue · comments

If a daemon is running a long task and daemons(n = 0L) is called, the daemon continues to run the task and does not exit until the task is finished. A reprex is below.

library(mirai)
library(ps)
url <- "ws://127.0.0.1:5700"
daemons(n = 1L, url = url)
#> [1] 1
tasks <- replicate(2, mirai(Sys.sleep(100)))
launch_local(url = url)
Sys.sleep(2)
daemons(n = 0L)
#> [1] 0
# Check the daemon PID on e.g. htop.

As you explained in #86, this behavior is desirable in many situations: e.g. if a task has intermediate checkpoints and needs a chance to log them. In targets pipelines, however, the extra execution time is not useful because the metadata cannot be logged if the central R process is interrupted. In addition, I worry about mounting costs on expensive cloud services. So I think it would be useful to be able to configure the daemons to exit as soon as the TCP connections are broken. c.f. wlandau/crew#141

the extra execution time is not useful because the metadata cannot be logged if the central R process is interrupted. In addition, I worry about mounting costs on expensive cloud services.

The more I reflect on leftover daemons, the more trouble I anticipate for a typical targets user. A user will often start a pipeline, notice a problem, interrupt the pipeline, and repeat. Any lingering daemons from the first attempt will conflict with newer daemons from subsequent attempts, and all these daemons may fight over each other to store their data. The user may end up with the wrong output files: either obsolete data from a canceled pipeline attempt, or the mangled result of multiple concurrent processes writing to the same path.

To ensure they get the correct data, users need to manually clean up dangling daemons before starting new ones. Few people know how to do this, and although I am trying to help via wlandau/crew.aws.batch#2, none of my own workarounds can be fully automated.

It might be possible to allow this through an argument at daemon(). That way crew or targets can set it if desired. I think I'll be constrained by what is allowed in package space though, meaning probably raising SIGINT to signal an interrupt. This will cause R evaluation to stop, but uninterruptible C code may still continue until a user interrupt check is performed. Hence, your other monitoring utilities will still likely be useful in case they need to perform 'hard' shutdowns.

It might be possible to allow this through an argument at daemon(). That way crew or targets can set it if desired.

That would be such a huge help as far as safety is concerned!

I think I'll be constrained by what is allowed in package space though, meaning probably raising SIGINT to signal an interrupt.

Are you referring to the clause from https://cran.r-project.org/doc/manuals/R-exts.html#Writing-portable-packages?

Under no circumstances should your compiled code ever call abort or exit: these terminate the user’s R process, quite possibly losing all unsaved work.

I think daemon() is a legitimate exception. In this case, losing all unsaved work is the goal. For an automated pipeline, if the dispatcher exits, then all unsaved work from the daemon is moot at best, harmful at worst.

Also, when they say "the user’s R process", I think they envision a single interactive local R session with a backlog of unsaved work from multiple projects. By contrast, a daemon is external and limited in scope.

This will cause R evaluation to stop, but uninterruptible C code may still continue until a user interrupt check is performed.

Yes, I suppose not every block of C code can check R_CheckUserInterrupt().

callr deliberately uses SIGKILL on its own child processes: r-lib/callr#56 (comment), r-lib/callr#56 (comment). SIGKILL can terminate even C code, and I consider any precedent from callr to be trustworthy, reputable, and CRAN-safe.

Thanks for the link, that’s really quite interesting! It led me to do more thinking:

First, I think you are familiar with how pipe events work, but just to re-cap, we are not actually sending a signal to tell the daemons to exit (this would not be universally reliable for any type of remote connection), rather we are setting it up to raise the signal on itself when the connection is broken.

With this in mind, callr isn’t really equivalent as it is narrower in scope – it only launches background R processes locally. It would probably be safe in this case for the child processes to be killed.

With the open architecture design of mirai however, I can’t expect daemon() to always be in such a ‘disposable’ background process. Even crew calls daemon() through its own function, and it is reasonable to expect that similar setups may have an interrupt handler (in the simplest form a tryCatch(…, interrupt = ) for example) perhaps for logging, perhaps for other reasons (re-spawning being an obvious one).

At the end of the day, we are talking about the length of an individual .Call() that signals may not reach immediately (if it just so happens to be within that call), and it has always been the recommended practice for long-running C functions to call R_CheckUserInterrupt() at intervals in any case.

So for something like targets I think it’s going to be more than enough – at least for the intended purpose. Thinking through the consequences, if it were the case that such a non-interrupted call caused some data to be wrongly overwritten, then forcefully killing that process mid-write could result in more serious harm including data corruption. SIGINT is much safer in this regard in at least allowing a graceful exit.

Thinking through the consequences, if it were the case that such a non-interrupted call caused some data to be wrongly overwritten, then forcefully killing that process mid-write could result in more serious harm including data corruption. SIGINT is much safer in this regard in at least allowing a graceful exit.

I see what you mean. SIGINT allows cleanup to happen and ensures that output storage is not left in a corrupted or unusable state. In that sense, SIGINT is a more graceful and robust solution.

On the other hand, I do not think an abrupt SIGKILL is really such a disaster. If a data file is corrupted, all one needs to do is rerun the original code that produces it. (And targets will probably skip everything else, in which case fixing the data will consume almost no resources.)

In contrast, runaway processes are truly catastrophic. On the cloud, they could rapidly burn through tens of thousands of dollars and tank the cloud budgets of entire departments (mine included). The probability of disaster is not small: poorly-written C/C++ code is prevalent in statistical modeling packages, and most of my colleagues who use targets are not skilled in computing. And if a disaster does happen, the cleanup is long and arduous.

I work with hundreds of clinical statisticians, and for almost everyone, their skill and awareness do not extend beyond the local R interpreter and RStudio environment. Most of them use targets to run large, complex, custom-coded simulations to design and optimize clinical trials. Each targets pipeline uses anywhere from 100 to 200 persistent workers concurrently. At the moment, we use an on-prem SGE cluster. In this environment, it is trivially easy to view cluster jobs with qstat and terminate them all with a single call to qdel. However, almost none of my colleagues understand HPC enough to realize they should do this, and almost none of them know how to directly engage with the cluster (or the Linux terminal for that matter).

On top of that, we are trying to move to a cloud platform with workers on AWS Batch. Last Friday when I implemented job management utilities in crew.aws.batch, I learned that there is no wholesale way to terminate a group of Batch jobs. To manually terminate a pipeline of 100-200 workers, one would need to go through each one individually and send a separate TerminateJob HTTP API request. Even for someone who knows what they are doing, it could easily take the better part of an hour to put out a fire.

I saw your work at shikokuchuo/nanonext#25, and I am super grateful. SIGINT does get us most of the way there. But given the differences in our use cases and priorities, I wonder if it would be possible to let the user choose the signal type and make SIGINT the default.

Note that in my last reply I said for the intended purpose.

So for something like targets I think it’s going to be more than enough – at least for the intended purpose.

To make it absolutely clear, the intended purpose here is to stop mirai evaluation, to address your initial ask:

A user will often start a pipeline, notice a problem, interrupt the pipeline, and repeat.

If you are concerned about expenses on cloud servers or HPC resources not being released, then the primary tool for addressing failure must be an external monitoring solution, using a method that is officially sanctioned via the cluster manager or API call etc.

Anything implemented here may help to minimise the chances of such situations as you identify, but cannot and should not be thought of as a failsafe as there are many other reasons that processes may hang.

Having said that, I think your last idea is a good one.

SIGINT isn't the only legitimate signal that can be sent. A sophisticated consumer may install custom handlers to be able to differentiate pipe additions or removals from a user interrupt (to pause actions until a reconnect for example). In such a case, it would be up to the user to supply a signal to send and there would not be a default.

I need time to think on the implementation, but leave this with me.

This capability has been implemented in nanonext, and will be exposed by allowing a signal to be supplied to the 'autoexit' argument of daemon().

@shikokuchuo, I can't thank you enough! Using only the tools available on the host platform, it is almost impossible to automatically control runaway processes in the general case, especially the cloud. Every bit of help from mirai and nanonext goes such a long way.

I tested this functionality and built it into development crew. Everything works super well.

You're welcome!

An update: as I was testing just now, I found out that tools::SIGKILL is actually NA on Windows. That led me to do some more digging on signals, and I learned SIGKILL on a parent process can lead to zombie child processes. Zombies would defeat the whole purpose of my original preference for SIGKILL, so I am moving away from it. Instead, crew now prefers SIGTERM because the intention to kill the process is more explicit than SIGINT. If for some reason that signal is not defined on the user's platform, then it uses SIGQUIT. And if SIGQUIT is not defined, it uses SIGINT as a last resort. I don't think I will expose the choice of signal to crew users.

It appears SIGTERM can terminate nanonext::until(nanonext::cv(), 600 * 1e3) even though SIGINT cannot, which is a relief.