tcp communication error CancelledError
mens-artis opened this issue · comments
mens-artis commented
I run the basics example 7.
It seems that there is a TCP Error at the end (that is expected), when the receiving end of the connection hangs up. This may have been introduced by the fix for [WARNING][dask_runner.py:135] No workers are available. This could mean workers crashed.
, i.e.
try:
self._client.wait_for_workers(n_workers=1, timeout=1200)
except asyncio.exceptions.TimeoutError as error:
logger.debug(f"No worker could be scheduled in time after {self._worker_timeout}s on the cluster. "
"Try increasing `worker_timeout`.")
raise error
The problem is that this may obscure real problems.
Traceback (most recent call last):
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 491, in connect
stream = await self.client.connect(
File "/home/username/.python3.10.6/lib/python3.10/site-packages/tornado/tcpclient.py", line 282, in connect
af, addr, stream = await connector.start(connect_timeout=timeout)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/username/.python3.10.6/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
return fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/core.py", line 343, in connect
comm = await wait_for(
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/home/username/.python3.10.6/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/username/SMAC3_wrapped/example7.py", line 119, in <module>
incumbent = smac.optimize()
File "/home/username/.python3.10.6/lib/python3.10/site-packages/smac/facade/abstract_facade.py", line 319, in optimize
incumbents = self._optimizer.optimize(data_to_scatter=data_to_scatter)
File "/home/username/.python3.10.6/lib/python3.10/site-packages/smac/main/smbo.py", line 304, in optimize
self._runner.submit_trial(trial_info=trial_info, **dask_data_to_scatter)
File "/home/username/.python3.10.6/lib/python3.10/site-packages/smac/runner/dask_runner.py", line 135, in submit_trial
if self.count_available_workers() <= 0:
File "/home/username/.python3.10.6/lib/python3.10/site-packages/smac/runner/dask_runner.py", line 182, in count_available_workers
return sum(self._client.nthreads().values()) - len(self._pending_trials)
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/client.py", line 3919, in nthreads
return self.sync(self.scheduler.ncores, workers=workers, **kwargs)
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/utils.py", line 359, in sync
return sync(
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/utils.py", line 426, in sync
raise exc.with_traceback(tb)
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/utils.py", line 399, in f
result = yield future
File "/home/username/.python3.10.6/lib/python3.10/site-packages/tornado/gen.py", line 767, in run
value = future.result()
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/core.py", line 1344, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/core.py", line 1626, in connect
return connect_attempt.result()
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/core.py", line 1516, in _connect
comm = await connect(
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/core.py", line 369, in connect
raise OSError(
OSError: Timed out trying to connect to tcp://....166.214:34417 after 30 s
msg: {'status': 'OK', 'time': 1694708544.0040095, 'heartbeat-interval': 0.5}
msg: {'status': 'OK', 'time': 1694708550.0110245, 'heartbeat-interval': 0.5}
With or without the sleep(6), there is a disconnect.
err:
cat tmp/smac_dask_slurm/dask-worker-9277642.err
cpu-bind=MASK - dlcgpu26, task 0 0 [25985]: mask 0x8000000080 set
2023-09-14 18:53:42,082 - distributed.nanny - INFO - Start Nanny at: 'tcp://....166.186:34089'
2023-09-14 18:53:42,662 - distributed.worker - INFO - Start worker at: tcp://....166.186:38039
2023-09-14 18:53:42,662 - distributed.worker - INFO - Listening to: tcp://....166.186:38039
2023-09-14 18:53:42,662 - distributed.worker - INFO - Worker name: SLURMCluster-3
2023-09-14 18:53:42,662 - distributed.worker - INFO - dashboard at: ....166.186:46501
2023-09-14 18:53:42,662 - distributed.worker - INFO - Waiting to connect to: tcp://....166.214:33319
2023-09-14 18:53:42,662 - distributed.worker - INFO - -------------------------------------------------
2023-09-14 18:53:42,662 - distributed.worker - INFO - Threads: 1
2023-09-14 18:53:42,662 - distributed.worker - INFO - Memory: 0.93 GiB
2023-09-14 18:53:42,662 - distributed.worker - INFO - Local Directory: /tmp/dask-scratch-space/worker-ls7s1wou
2023-09-14 18:53:42,662 - distributed.worker - INFO - -------------------------------------------------
2023-09-14 18:53:42,669 - distributed.worker - INFO - Starting Worker plugin shuffle
2023-09-14 18:53:42,670 - distributed.worker - INFO - Registered to: tcp://....166.214:33319
2023-09-14 18:53:42,670 - distributed.worker - INFO - -------------------------------------------------
2023-09-14 18:53:42,670 - distributed.core - INFO - Starting established connection to tcp://....166.214:33319
slurmstepd-dlcgpu26: error: *** JOB 9277642 ON dlcgpu26 CANCELLED AT 2023-09-14T18:54:25 ***
2023-09-14 18:54:25,422 - distributed.worker - INFO - Stopping worker at tcp://....166.186:38039. Reason: scheduler-close
2023-09-14 18:54:25,423 - distributed.batched - INFO - Batched Comm Closed <TCP (closed) Worker->Scheduler local=tcp://....166.186:54362 remote=tcp://....166.214:33319>
Traceback (most recent call last):
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 317, in write
raise StreamClosedError()
tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
nbytes = yield coro
File "/home/username/.python3.10.6/lib/python3.10/site-packages/tornado/gen.py", line 767, in run
value = future.result()
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 328, in write
convert_stream_closed_error(self, e)
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 143, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Worker->Scheduler local=tcp://....166.186:54362 remote=tcp://....166.214:33319>: Stream is closed
2023-09-14 18:54:25,431 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://....166.186:34089'. Reason: scheduler-close
2023-09-14 18:54:25,442 - distributed.core - INFO - Received 'close-stream' from tcp://....166.214:33319; closing.
2023-09-14 18:54:25,443 - distributed.nanny - INFO - Worker closed
out
(venv3106) username@node1:$ cat tmp/smac_dask_slurm/dask-worker-9277642.out
/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/core.py : trying for another 29.999988079071045 seconds
/home/username/.python3.10.6/lib/python3.10/site-packages/tornado/tcpclient.py-- timeout = None
msg: {'compression': None, 'python': (3, 10, 6), 'pickle-protocol': 5}
msg: {'op': 'register_nanny', 'address': 'tcp://....166.186:34089'}
mens-artis commented
I wonder if it was always based on Dask.
Alexander Tornede commented
Once again, thanks for reporting this. We will look into it.