dask / dask-docker

Docker images for dask

Home Page:https://hub.docker.com/u/daskdev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OSError: Timed out trying to connect to 'tcp://scheduler:8786' after 10 s

hcorrada opened this issue · comments

Testing a docker-compose setup with one scheduler, one worker, and one client

docker-compose.yml:

version: "3.1"

services:
  scheduler:
    image: daskdev/dask
    hostname: dask-scheduler
    ports:
      - "8786:8786"
      - "8787:8787"
    command: ["dask-scheduler"]

  worker:
    image: daskdev/dask
    hostname: dask-worker
    command: ["dask-worker", "tcp://scheduler:8786"]

  client:
    build: client
    environment:
        - DASK_SCHEDULER_ADDRESS=scheduler:8786
    command: ["python", "script.py"]

client Dockerfile

FROM python:3.8-slim

ENV VIRTUAL_ENV=/opt/env
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

WORKDIR /app

COPY ./requirements.txt /app/requirements.txt
RUN apt-get update \
    && apt-get install gcc -y \
    && apt-get clean

RUN pip install -r /app/requirements.txt \
    && rm -rf /root/.cache/pip

COPY . /app/

client script

import os
from dask.distributed import Client

dask_scheduler = os.getenv("DASK_SCHEDULER_ADDRESS")
cl = Client(dask_scheduler)
print(cl)

repo available here: https://github.com/hcorrada/test-dask

What happened:
Client could not connect to scheduler:

Traceback (most recent call last):

File "/opt/env/lib/python3.8/site-packages/distributed/comm/core.py", line 313, in connect

_raise(error)

File "/opt/env/lib/python3.8/site-packages/distributed/comm/core.py", line 266, in _raise

raise IOError(msg)

OSError: Timed out trying to connect to 'tcp://scheduler:8786' after 10 s: connect() didn't finish in time


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

File "script.py", line 22, in <module>

cl = Client(dask_scheduler)

File "/opt/env/lib/python3.8/site-packages/distributed/client.py", line 744, in __init__

self.start(timeout=timeout)

File "/opt/env/lib/python3.8/site-packages/distributed/client.py", line 949, in start

sync(self.loop, self._start, **kwargs)

File "/opt/env/lib/python3.8/site-packages/distributed/utils.py", line 339, in sync

raise exc.with_traceback(tb)

File "/opt/env/lib/python3.8/site-packages/distributed/utils.py", line 323, in f

result[0] = yield future

File "/opt/env/lib/python3.8/site-packages/tornado/gen.py", line 735, in run

value = future.result()

File "/opt/env/lib/python3.8/site-packages/distributed/client.py", line 1046, in _start

await self._ensure_connected(timeout=timeout)

File "/opt/env/lib/python3.8/site-packages/distributed/client.py", line 1103, in _ensure_connected

comm = await connect(

File "/opt/env/lib/python3.8/site-packages/distributed/comm/core.py", line 325, in connect

_raise(error)

File "/opt/env/lib/python3.8/site-packages/distributed/comm/core.py", line 266, in _raise

raise IOError(msg)

OSError: Timed out trying to connect to 'tcp://scheduler:8786' after 10 s: Timed out trying to connect to 'tcp://scheduler:8786' after 10 s: connect() didn't finish in time

What you expected to happen:

Client to connect

Minimal Complete Verifiable Example:

above

Environment:

  • Docker version: Docker version 19.03.12, build 48a66213fe
  • Operating System: MacOS Darwin Kernel Version 19.5.0
  • Install method (conda, pip, source): pip (via docker)

Updated repo to show that
(a) netcat on the client instance can open connection
(b) python socket module on client script can also open connection

https://github.com/hcorrada/test-dask

This seems relevant but could not figure out a solution from it:
dask/distributed#2504

Thanks for raising this @hcorrada.

My first guess would be that you've set the address to scheduler:8786 but configured the hostname of the scheduler to be dask-scheduler, so the correct address would be dask-scheduler:8786.

Some other points:

  • The client, scheduler and workers should have the same conda environment with the same package versions (the scheduler is less important with this but still worth doing for debugging). Do instead of basing your client image on python:3.8-slim I would recommend using daskdev/dask.
  • Dask loads config from environment variables. So setting DASK_SCHEDULER_ADDRESS is enough, you do not need to use os.environ to grab the value and pass it to the Client.

Thanks @jacobtomlinson. I followed the docker-compose.yml file on the repo:

command: ["dask-worker", "tcp://scheduler:8786"]

so the name mismatch would be a problem there.

Our setup works if we build dask-docker/base locally so perhaps is an issue with the dockerhub images?

We generally only use that docker-compose.yml file for building images in CI. So we run docker-compose build on the repo which builds the base and then higher level images in one go.

I must confess I didn't actually realise you had linked to our own file when you said you were using docker compose to start a cluster. We should definitely fix that. Do you have any interest in testing the change and raising a PR to correct the names?

@hcorrada Were you able to make this to work? I'm having the same issue. My worker container can't even connect to the scheduler. I tried changing the url to match the hostname but that didn't work.

The client [our dockerfile] and scheduler dask versions [pulled image] were not the same. Once we forced them to be, it worked.