Parsl / parsl

Parsl - a Python parallel scripting library

Home Page:http://parsl-project.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

parsl hangs when interchange fails to start

benclifford opened this issue · comments

Describe the bug

When the interchange fails to start, parsl will hang.

in desc development branch, this particular error comes about because of address="localhost" in my local config, which became invalid as of PR #2828 - see trace below - but it is related to any interchange startup failure.

2023-07-20 08:31:59.529 interchange:128 HTEX-Interchange(53526) MainThread __init__ [DEBUG]  Initializing Interchange process
2023-07-20 08:31:59.529 interchange:134 HTEX-Interchange(53526) MainThread __init__ [INFO]  Attempting connection to client at 127.0.0.1 on ports: 55785,55909,55898
2023-07-20 08:31:59.530 interchange:147 HTEX-Interchange(53526) MainThread __init__ [INFO]  Connected to client
2023-07-20 08:31:59.530 interchange:31 HTEX-Interchange(53526) MainThread wrapped [ERROR]  Exceptional ending for starter on thread MainThread
Traceback (most recent call last):
  File "/home/benc/parsl/src/parsl/parsl/process_loggers.py", line 27, in wrapped
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/benc/parsl/src/parsl/parsl/executors/high_throughput/interchange.py", line 638, in starter
    ic = Interchange(*args, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benc/parsl/src/parsl/parsl/executors/high_throughput/interchange.py", line 171, in __init__
    self.worker_task_port = self.task_outgoing.bind_to_random_port(f"tcp://{self.interchange_address}",
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/zmq/sugar/socket.py", line 498, in bind_to_random_port
    self.bind(f'{addr}:{port}')
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/zmq/sugar/socket.py", line 302, in bind
    super().bind(addr)
  File "zmq/backend/cython/socket.pyx", line 564, in zmq.backend.cython.socket.Socket.bind
  File "zmq/backend/cython/checkrc.pxd", line 28, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: No such device (addr='tcp://localhost:54518')

A clear and concise description of what the bug is.

To Reproduce

parsl master c39700b plus this patch, then run pytest.

--- a/parsl/executors/high_throughput/interchange.py
+++ b/parsl/executors/high_throughput/interchange.py
@@ -614,6 +614,7 @@ def starter(comm_q, *args, **kwargs):
 
     The executor is expected to call this function. The args, kwargs match that of the Interchange.__init__
     """
+    raise RuntimeError("Deliberate hang")
     setproctitle("parsl: HTEX interchange")
     # logger = multiprocessing.get_logger()

Expected behavior
If the interchange doesn't start properly (or if it dies any time later on during execution), this should be a halting terminal error for htex, not a hang.

Environment
my laptop

Hi, I got the same issue when using v2023.8.7. Reverting back to v1.2.0 solved my problem.