rapidsai / ucx-py

Python bindings for UCX

Home Page:https://ucx-py.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Confusing arm_worker error during DAG failure in a UCX enabled Dask-CUDA cluster

randerzander opened this issue · comments

With a single node dask-cuda cluster configured to use UCX (nvlink enabled, IB disabled), when a Dask DAG fails (likely due to OOM), the error message I receive is:

sys:1: RuntimeWarning: coroutine 'BlockingMode._arm_worker' was never awaited                                         
RuntimeWarning: Enable tracemalloc to get the object allocation traceback                                             
Task was destroyed but it is pending!                                                                                 
task: <Task cancelling name='Task-412494' coro=<BlockingMode._arm_worker() running at /home/rgelhausen/conda/envs/dsql-1-20/lib/python3.8/site-packages/ucp/continuous_ucx_progress.py:88>>

Env details:

(dsql-1-20) rgelhausen@rl-dgx2-r13-u7-rapids-dgx201:~/shared/gpu-bdb/gpu_bdb/cluster_configuration$ conda list | grep ucx
ucx                       1.12.0+gd367332      cuda11.2_0    rapidsai-nightly
ucx-proc                  1.0.0                       gpu    rapidsai-nightly
ucx-py                    0.24.0a220120   py38_gd367332_26    rapidsai-nightly
(dsql-1-20) rgelhausen@rl-dgx2-r13-u7-rapids-dgx201:~/shared/gpu-bdb/gpu_bdb/cluster_configuration$ conda list | grep dask
dask                      2022.1.0+10.gc1c88f06          pypi_0    pypi
dask-cudf                 22.2.0a0+300.g12a0f596e5          pypi_0    pypi
dask-glm                  0.2.0                    pypi_0    pypi
dask-labextension         5.2.0              pyhd8ed1ab_0    conda-forge
dask-ml                   2021.11.31.dev2+g1e811ce4          pypi_0    pypi
dask-sql                  2021.12.1.dev34+g736f264          pypi_0    pypi