pharo-project / pharo-vm

Hi

ce69c3e introduced a call to interruptAIOPoll in signalSemaphoreWithIndex. The problem is that it can deadlock when signalling semaphores from multiple threads.

pharo-vm/extracted/vm/src/common/sqExternalSemaphores.c

Line 195 in ce69c3e

interruptAIOPoll();

signalSemaphoreWithIndex should be safe to use from multiple threads and even signal the same semaphore:

/* Signal the external semaphore with the given index.  Answer non-zero on
 * success, zero otherwise.  This function is (should be) thread-safe;
 * multiple threads may attempt to signal the same semaphore without error.
 * An index of zero should be and is silently ignored.
 */

I couldn't find an issue for the original change. What was the intention for the change?

The implementation in opensmalltalk-vm doesn't call interruptAIOPoll. This is because interruptAIOPoll is not thread safe and shouldn't be used in signalSemaphoreWithIndex.

Cheers

@tesonep @guillep @Ducasse

Would it please be possible to give us any feedback ?

In one of our sub projects we hit this problem consistently.
We have a fix but we do not understand the reasoning of the original change.

Hi we will have a look and come back to you. Now the week of 2 working days are over.

Thx.

Hi, the reason to having this is to allow to have really long idle periods.
Idle periods are interrupted by a socket / file operation or by a signalling semaphore.
In which case are you having issues with the implementation of interruptAIOPoll, as write is thread-safe (it does not guarantee order, but we don't care, we just want to put some data in the pipe so the poll / select is interrupted).
It was added to have a long idle VM, what is the case scenario that you are having and how do you arrive to it? Because we are using it with multiple threads signalling and don't having the issue. E.g., FFI callbacks in an idle VM (a VM that is waiting for really long time in the relinquishProcessor primitive).

Have you tested that the problem is not a timing issue with your code calling the signalling of the semaphore, as this implementation will resume the execution of the VM thread faster than the older one? If you remove this mechanism the VM thread will not resume inmediately but it will do it after the relinquishProcessor time ended.

Again if you can provide us an example we can see the issue, but this change has been since 2019 and we have not seen problems using it in multi threading applications

Hi @tesonep
Thanks for the answer. interruptAIOPoll on osx has the following line:

interruptFIFOMutex->wait(interruptFIFOMutex)

signalSemaphoreWithIndex will never return If interruptFIFOMutex is not signalled.
When signalSemaphoreWithIndex is called in parallel from both VM and another thread it deadlocks on interruptFIFOMutex:

I don't see code that can be interrupted before signalling the semaphore.
All the exclusions zones are using only variables.
Do you have changes to the event handling mechanism of the VM?

Hi @tesonep
It can be reproduced in any Pharo since 2019. I have created a repo with a minimal reproducible example:
https://github.com/syrel/pharo-vm-804

There are just a few steps (see the Readme.md) and you are good to go.
Pharo 10, 11, 12, 13, all deadlock in a similar way.

GitHub
GitHub - syrel/pharo-vm-804: Issue #804
Issue #804. Contribute to syrel/pharo-vm-804 development by creating an account on GitHub.

Hi, thanks for the example. I have reproduce it and I understand what is happening.
The problem is with the interaction of signals.

As the signal handler is signalling a Pharo Semaphore it is passing through the code that is in the mutex.
The event handling is also using the same mutex.
As the signal handler is executed in the current thread (if OSX) or main thread (if Linux), it might interrrupt the event handling process.
If the interruption is inside the mutually exclusive zone, the semaphore is not signalled. As the implementation of semaphores is not reentrant (allowing to wait in the same thread again), it will block forever

I will recommend to change the usage of signals to communicate with Pharo External semaphores or callbacks.
Making it work requires that signals are handled in safe points of the VM, the OSUnixProcess Plugin is not intended for that and might fail when the signal is frequently used.

Sadly, this is not an issue that is a priority for us, so I will not provide a solution in the short time.
If there is urgency about this, please feel free to submit a PR or contact us to use time through the support of the Consortium (members engineering time or a custom contract if not enough).

Hi @tesonep
Thank you for the explanation 👍 I think we can close this

Thanks for the example. It helped understanding the problem.

Deadlock when signalling external semaphores