Confusing arm_worker error during DAG failure in a UCX enabled Dask-CUDA cluster
randerzander opened this issue · comments
Randy Gelhausen commented
With a single node dask-cuda cluster configured to use UCX (nvlink enabled, IB disabled), when a Dask DAG fails (likely due to OOM), the error message I receive is:
sys:1: RuntimeWarning: coroutine 'BlockingMode._arm_worker' was never awaited
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Task was destroyed but it is pending!
task: <Task cancelling name='Task-412494' coro=<BlockingMode._arm_worker() running at /home/rgelhausen/conda/envs/dsql-1-20/lib/python3.8/site-packages/ucp/continuous_ucx_progress.py:88>>
Env details:
(dsql-1-20) rgelhausen@rl-dgx2-r13-u7-rapids-dgx201:~/shared/gpu-bdb/gpu_bdb/cluster_configuration$ conda list | grep ucx
ucx 1.12.0+gd367332 cuda11.2_0 rapidsai-nightly
ucx-proc 1.0.0 gpu rapidsai-nightly
ucx-py 0.24.0a220120 py38_gd367332_26 rapidsai-nightly
(dsql-1-20) rgelhausen@rl-dgx2-r13-u7-rapids-dgx201:~/shared/gpu-bdb/gpu_bdb/cluster_configuration$ conda list | grep dask
dask 2022.1.0+10.gc1c88f06 pypi_0 pypi
dask-cudf 22.2.0a0+300.g12a0f596e5 pypi_0 pypi
dask-glm 0.2.0 pypi_0 pypi
dask-labextension 5.2.0 pyhd8ed1ab_0 conda-forge
dask-ml 2021.11.31.dev2+g1e811ce4 pypi_0 pypi
dask-sql 2021.12.1.dev34+g736f264 pypi_0 pypi