yarpmanager gets (almost) stuck in case the connection to yarpserver is delayed
S-Dafarra opened this issue · comments
Describe the bug
We have been largely exploiting yarpmanager
to run demos involving several applications running on different PCs using different OS. So far, everything is great. Some difficulty arises when yarpserver
is placed in a very delayed network. In this case, clicking the wrong button might render the system unusable. To give an example, if we press the refresh all button, it is not possible to do anything else in the application until all the applications, connections, and resources have been checked. The problem is that in case of a delayed network, this operation might require minutes. The problem gets worst if you try to close the application that got stuck as the entire yarpmanager
gets completely stuck
To Reproduce
One possible way to reproduce it is to set yarp conf
to an IP not reachable, and then open yarpmanager
without running yarpserver
. Then, open the application with the largest number of applications and press on the refresh
all button. Then, try to close the blocked application
Expected behavior
When attempting to run another action while another one is running, yarpmanager
should simply stop the previous one. At the same time, if the user wants to close the open application, it should close without blocking the entire yarpmanager
.
Screenshots
In the following I just tried to open an application and then I try to close it immediately after. In this specific case, the issue is that yarpserver
is not running, but the same happens when yarpserver
is in a delayed network
YARP.module.manager.2024-04-16.18-26-33.mp4
Configuration (please complete the following information):
- OS: Ubuntu/Windows
- yarp version: 3.9.0
- compiler:
Additional context
Add any other context about the problem here.
I dug a bit into the code. When pressing the "refresh all" button the following code is executed
yarp/src/yarpmanager/src-manager/mainwindow.cpp
Lines 759 to 775 in e4762b0
which in turn calls
yarp/src/yarpmanager/src-manager/applicationviewwidget.cpp
Lines 1710 to 1717 in e4762b0
and then
yarp/src/yarpmanager/src-manager/applicationviewwidget.cpp
Lines 1353 to 1406 in e4762b0
The juicy bit is
yarp/src/yarpmanager/src-manager/applicationviewwidget.cpp
Lines 1398 to 1400 in e4762b0
that spawns a new thread
yarp/src/yarpmanager/src-manager/safe_manager.cpp
Lines 467 to 484 in e4762b0
After this, safeManger
appears busy
until the action is done. This means that in all subsequent actions, the lines
yarp/src/yarpmanager/src-manager/applicationviewwidget.cpp
Lines 1355 to 1357 in e4762b0
will always return
false
. This appears the reason why it is not possible to do anything else on the application until the previous action is done.
Instead, when closing the tab, the following code is executed:
yarp/src/yarpmanager/src-manager/mainwindow.cpp
Lines 795 to 864 in e4762b0
which calls
yarp/src/yarpmanager/src-manager/applicationviewwidget.cpp
Lines 2393 to 2395 in e4762b0
and then it calls
yarp/src/yarpmanager/src-manager/safe_manager.cpp
Lines 24 to 29 in e4762b0
This gets stuck in
since it is blocking until the separate thread finishes the
run
method.Since this is happening in the GUI thread, the whole application gets stuck and the OS tags it as "Not Responding"
A possible idea
In the end, the main issue is that the run
method is blocking until everything has been done
yarp/src/yarpmanager/src-manager/safe_manager.cpp
Lines 95 to 381 in e4762b0
One possibility would be to use a flag (an atomic boolean for example) to interrupt prematurely the execution of the different for loops to make sure that no action causes the entire yarpmanager
to get stuck.
This is a well known behaviour but it is exremily difficult to improve. Checking if a connection is active or not, involves the establishment of multiple (4?) back and forth communications with the server ( the same of 'yarp ping' command') each of them requiring a ping time. So, if the network ping time is 100ms, each connection to verify might require 4x100ms. If you have 10 connections, these are 4 seconds of delay. No escape. The obvious solution is to keep your ping time to yarp server in the 10ms range or run the yarpserver elsewhere. Ps: fixing the network delay is your preferred option because a responsive yarp server with delayed communication is very bad too (and not equally detactable).
Checking if a connection is active or not, involves the establishment of multiple (4?) back and forth communications with the server ( the same of 'yarp ping' command') each of them requiring a ping time.
Is it possible to prematurely interrupt a ping without waiting for the timeout time?
Regarding the possibility of interrupting an ongoing operation, instead, I think that this is certainly doable (between two subsequent checks).
Regarding the possibility of interrupting an ongoing operation, instead, I think that this is certainly doable (between two subsequent checks).
Yeah exactly, that's what I had in mind. In principle, also in case of connection checks, the maximum waiting time should be 2 seconds, so it might still be a reasonable time to wait.
Checking if a connection is active or not, involves the establishment of multiple (4?) back and forth communications with the server ( the same of 'yarp ping' command') each of them requiring a ping time.
Is it possible to prematurely interrupt a ping without waiting for the timeout time?
No, this is not possible.
Checking if a connection is active or not, involves the establishment of multiple (4?) back and forth communications with the server ( the same of 'yarp ping' command') each of them requiring a ping time.
Is it possible to prematurely interrupt a ping without waiting for the timeout time?
No, this is not possible.
I had a further look on this. The timeout value seems to be used here
yarp/src/libYARP_os/src/yarp/os/Network.cpp
Lines 753 to 766 in e4762b0
which sets an internal value that is get in
yarp/src/libYARP_os/src/yarp/os/impl/SocketTwoWayStream.cpp
Lines 48 to 51 in e4762b0
So in the end, the timeout value is given directly to ACE. By reading the documentation of connect
, I noticed that it returns a different error when it cannot perform a connection given the timeout (namely EWOULDBLOCK
). At the same time, it is possible to call the complete
method, that is exactly meant to continue a connection attempt that timed out.
This means that in principle it may be possible to use a fraction of the TIMEOUT value when calling connect
and then call complete
several times until the user-specified timeout value is reached. In principle, this could give us control to stop this process at an earlier stage.