parsl hangs when interchange fails to start
benclifford opened this issue · comments
Describe the bug
When the interchange fails to start, parsl will hang.
in desc development branch, this particular error comes about because of address="localhost" in my local config, which became invalid as of PR #2828 - see trace below - but it is related to any interchange startup failure.
2023-07-20 08:31:59.529 interchange:128 HTEX-Interchange(53526) MainThread __init__ [DEBUG] Initializing Interchange process
2023-07-20 08:31:59.529 interchange:134 HTEX-Interchange(53526) MainThread __init__ [INFO] Attempting connection to client at 127.0.0.1 on ports: 55785,55909,55898
2023-07-20 08:31:59.530 interchange:147 HTEX-Interchange(53526) MainThread __init__ [INFO] Connected to client
2023-07-20 08:31:59.530 interchange:31 HTEX-Interchange(53526) MainThread wrapped [ERROR] Exceptional ending for starter on thread MainThread
Traceback (most recent call last):
File "/home/benc/parsl/src/parsl/parsl/process_loggers.py", line 27, in wrapped
r = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/benc/parsl/src/parsl/parsl/executors/high_throughput/interchange.py", line 638, in starter
ic = Interchange(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/benc/parsl/src/parsl/parsl/executors/high_throughput/interchange.py", line 171, in __init__
self.worker_task_port = self.task_outgoing.bind_to_random_port(f"tcp://{self.interchange_address}",
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/zmq/sugar/socket.py", line 498, in bind_to_random_port
self.bind(f'{addr}:{port}')
File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/zmq/sugar/socket.py", line 302, in bind
super().bind(addr)
File "zmq/backend/cython/socket.pyx", line 564, in zmq.backend.cython.socket.Socket.bind
File "zmq/backend/cython/checkrc.pxd", line 28, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: No such device (addr='tcp://localhost:54518')
A clear and concise description of what the bug is.
To Reproduce
parsl master c39700b plus this patch, then run pytest.
--- a/parsl/executors/high_throughput/interchange.py
+++ b/parsl/executors/high_throughput/interchange.py
@@ -614,6 +614,7 @@ def starter(comm_q, *args, **kwargs):
The executor is expected to call this function. The args, kwargs match that of the Interchange.__init__
"""
+ raise RuntimeError("Deliberate hang")
setproctitle("parsl: HTEX interchange")
# logger = multiprocessing.get_logger()
Expected behavior
If the interchange doesn't start properly (or if it dies any time later on during execution), this should be a halting terminal error for htex, not a hang.
Environment
my laptop
Hi, I got the same issue when using v2023.8.7
. Reverting back to v1.2.0
solved my problem.