replicate / cog-stable-diffusion

Diffusers Stable Diffusion as a Cog model

Home Page:https://replicate.com/stability-ai/stable-diffusion

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cog doesn't know if CUDA is compatible with PyTorch / Docker is missing required device driver

lukestanley opened this issue · comments

Cog says it's not sure about the compatibility up front, then (after a lot of downloads) it has Docker say: "Docker is missing required device driver".
I figured this is an issue since Cog pitches itself as:
" - 📦 Docker containers without the pain.

  • 🤬️ No more CUDA hell. Cog knows which CUDA/cuDNN/PyTorch/Tensorflow/Python combos are compatible and will set it all up correctly for you."

This is my log:
cog-stable-diffusion$ sudo cog run script/download-weights hf_******************************
⚠ Cog doesn't know if CUDA 11.6.2 is compatible with PyTorch 1.12.1 --extra-index-url=https://download.pytorch.org/whl/cu116. This might cause CUDA problems.
Building Docker image from environment in cog.yaml...
[+] Building 2.0s (16/16) FINISHED
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 1.67kB 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> resolve image config for docker.io/docker/dockerfile:1.2 0.9s
=> CACHED docker-image://docker.io/docker/dockerfile:1.2@sha256:e2a8561e419ab1ba6b2fe6cbdf49fd92b95912df1cf7d313c3e2230a333fdbcc 0.0s
=> [internal] load metadata for docker.io/nvidia/cuda:11.6.2-cudnn8-devel-ubuntu20.04 0.6s
=> [stage-0 1/8] FROM docker.io/nvidia/cuda:11.6.2-cudnn8-devel-ubuntu20.04@sha256:55211df43bf393d3393559d5ab53283d4ebc3943d802b04 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 31.63kB 0.0s
=> CACHED [stage-0 2/8] RUN rm -f /etc/apt/sources.list.d/cuda.list && rm -f /etc/apt/sources.list.d/nvidia-ml.list && apt 0.0s
=> CACHED [stage-0 3/8] RUN --mount=type=cache,target=/var/cache/apt apt-get update -qq && apt-get install -qqy --no-install-recom 0.0s
=> CACHED [stage-0 4/8] RUN curl -s -S -L https://raw.githubusercontent.com/pyenv/pyenv-installer/master/bin/pyenv-installer | bas 0.0s
=> CACHED [stage-0 5/8] COPY .cog/tmp/build1496174735/cog-0.0.1.dev-py3-none-any.whl /tmp/cog-0.0.1.dev-py3-none-any.whl 0.0s
=> CACHED [stage-0 6/8] RUN --mount=type=cache,target=/root/.cache/pip pip install /tmp/cog-0.0.1.dev-py3-none-any.whl 0.0s
=> CACHED [stage-0 7/8] RUN --mount=type=cache,target=/root/.cache/pip pip install diffusers==0.2.4 torch==1.12.1 --extra-index- 0.0s
=> CACHED [stage-0 8/8] WORKDIR /src 0.0s
=> exporting to image 0.1s
=> => exporting layers 0.0s
=> => writing image sha256:1c81aeabd3aa4357e1eda8a0c8ea7add1172a525b025079f2361d745f88beb33 0.0s
=> => naming to docker.io/library/cog-cog-stable-diffusion-base 0.0s
=> exporting cache 0.0s
=> => preparing build cache for export 0.0s

Running 'script/download-weights hf_******************************' in Docker with the current directory mounted as a volume...
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ⅹ Docker is missing required device driver

nvidia-smi
Wed Aug 31 14:08:41 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 33% 35C P8 1W / 38W | 5MiB / 2002MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3517 G /usr/lib/xorg/Xorg 2MiB |
+-----------------------------------------------------------------------------+

docker -v
Docker version 20.10.12, build 20.10.12-0ubuntu2~20.04.1

cat /etc/issue
Ubuntu 20.04.5 LTS

I would suggest to try to update your nvidia drivers to a newer version. I had the same issue, fiddled a bit (without success) with cuda/pythorch version. Then I updated the drivers (using nvidia own repo, I followed this guide for Debian) and then it worked.

I'm on 22.04
11.8 cuda toolkit + cudnn-local-repo-ubuntu2204-8.7.0.84_1.0-1_amd64.deb
gpu - 3090 -
NVIDIA-SMI 525.60.11 Driver Version: 525.60.11 CUDA Version: 12.0
all working.

UPDATE
running update
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia