Semaphores and cancellation

Question

Semaphores and cancellation

ljwo opened this issue 5 months ago · comments

Łukasz Wojakowski commented 5 months ago

Hi, many thanks for the work on this library that you share. This is neither a bug report nor a feature request, I just need to ask you a question.

We've been trying to (ab)use the library for slightly different purposes than designed: for parallel process control on a hardware device. And we have two scenarios that must work:

first, we need to be able to abort the execution at any point (i.e. not execute any more tasks),
and second, we need to provide mutual exclusion for parts of the parallel processes (built of many tasks in taskflow's sense) that operate on the same hardware subresource.

Also, we are using a builder to build a single static taskflow to be executed, so usually bigger logical flows are composed of smaller ones wrapped as module tasks.

We tried to handle the mutual exclusion by external semaphores acquired and released from inside our tasks executed within taskflow, but that leads to occasional deadlocks, with the scenario that a module task is started, the task that acquires our semaphore is executed and then, before the module task is finished, the thread steals a second task that acquires the same semaphore. Now, for the release task to run, the module task has to finish, but it will not since the thread on which it has to finish is blocked in the second acquire.

So now we are trying to use taskflow's own semaphores, in a very limited way. We only limit concurrency on module tasks, and such module tasks always do both acquire and release. And it seems to work fine, even with taskflow cancellation done as a .cancel() call on the tf::Future, but then comes my actual concern regarding this quote from the handbook:

"Cancelling a taskflow with tasks acquiring and/or releasing tf::Semaphore results is currently not supported."

I can see why it does not work in general, especially with tasks that do separate acquire or release, as the nodes that are waiting in the semaphores' waiting lists are not rescheduled and so the topology join counter does not go down to zero, which leads to deadlocks.

Our use case is much more limited, though: a single static taskflow where semaphores are both acquired and released by the same module tasks. Do you know of any scenarios that may lead to problems in such a case? Or is this fine then?

(We also tried to achieve cancellation by not using .cancel() on the tf::Future, disabling actual task execution in our own task wrapper and letting the taskflow run to its natural end, but that leads to other kinds of deadlocks on graphs with loops as the loop exit conditions are no longer updated. We probably could try harder this way, if you tell us that .cancel() will not work reliably even in our simple case).