[FR] Container GPU passthrough

Question

[FR] Container GPU passthrough

fakezeta opened this issue 2 years ago · comments

As per our conversation on Prusa3D Forum enable 3D acceleration with GPU passthrough.

vajonam · Answer 1 · Tue Apr 02 2024 21:24:34 GMT+0800 (China Standard Time)

https://hub.docker.com/r/nvidia/opengl/ mabye if we rebuild the container using this base, it might work with minimal change?

I will try these out and see how far I get.

vajonam · Answer 2 · Tue Apr 02 2024 23:15:07 GMT+0800 (China Standard Time)

This is most promising. https://hub.docker.com/r/damanikjosh/virtualgl-turbovnc

Mike Helfrich · Answer 3 · Wed Apr 03 2024 01:03:11 GMT+0800 (China Standard Time)

Sorry for the mega delay. Meant to look at this over the weekend, but was swamped with other stuff.

I think https://github.com/linuxserver/docker-kasm/blob/master/Dockerfile has a lot of really useful NVidia Docker bits that I plan to learn from and adapt the current Dockerfiles for my containers.

Basically I think I missed including the NVidia Toolkit package https://github.com/NVIDIA/nvidia-container-toolkit and its dependencies.

Hoping I can poke this later after work.

Mike Helfrich · Answer 4 · Wed Apr 03 2024 01:04:20 GMT+0800 (China Standard Time)

This is most promising. https://hub.docker.com/r/damanikjosh/virtualgl-turbovnc

Though on this topic, I am beginning to investigate alternative noVNC solutions since the one I use has been deprecated. I might also just fork it and maintain it, but we'll see.

vajonam · Answer 5 · Wed Apr 03 2024 03:00:23 GMT+0800 (China Standard Time)

Okay got it running.
using the default image and just doign apt install for prusa-slcier (ver 2.4)

https://hub.docker.com/r/damanikjosh/virtualgl-turbovnc

vajonam · Answer 6 · Wed Apr 03 2024 03:02:05 GMT+0800 (China Standard Time)

Btu the caveat is the to use VirtualGL, you need to run a minimal X server on your headless machine, and setup virtualgl_server but now it looks awesome.. I think the next bit of work will be to trim this down similar to what you have done in the dockerfile. figured this out we no longer need this..

vajonam · Answer 7 · Wed Apr 03 2024 03:03:19 GMT+0800 (China Standard Time)

Peek.2024-04-02.15-02.mp4

vajonam · Answer 8 · Wed Apr 03 2024 03:05:12 GMT+0800 (China Standard Time)

This is most promising. https://hub.docker.com/r/damanikjosh/virtualgl-turbovnc

Though on this topic, I am beginning to investigate alternative noVNC solutions since the one I use has been deprecated. I might also just fork it and maintain it, but we'll see.

TurboVNC / TigreVNC seems to be able to fit the bill

Mike Helfrich · Answer 9 · Wed Apr 03 2024 03:10:01 GMT+0800 (China Standard Time)

Peek.2024-04-02.15-02.mp4

Thanks a ton for the work on this so far! It's looking super smooth for the slicing view now. Feel free to send a pull request if you'd like and I'm happy to review and merge 😄 .

This is most promising. https://hub.docker.com/r/damanikjosh/virtualgl-turbovnc

Though on this topic, I am beginning to investigate alternative noVNC solutions since the one I use has been deprecated. I might also just fork it and maintain it, but we'll see.

TurboVNC / TigreVNC seems to be able to fit the bill

Neither of those provide a web browser package though, correct? Ideally that's something we'd probably like to retain for the repos.

Thanks again for the work on this!

vajonam · Answer 10 · Wed Apr 03 2024 03:21:59 GMT+0800 (China Standard Time)

Not sure if I can do a PR against your re-pro it will be all new I think.

https://github.com/damanikjosh/virtualgl-turbovnc-docker/blob/main/Dockerfile uses a base

ARG UBUNTU_VERSION=22.04

FROM nvidia/opengl:1.2-glvnd-runtime-ubuntu${UBUNTU_VERSION}

so its very bloated being based on ubuntu. however its got all the same bits, instead of openbox it uses some other DM (lbuntu), but has VNC and novnc, (my video you can see its all in a browser) I will fork this and see what I can do. but this will only work nvidia GPU's obviously.

vajonam · Answer 11 · Wed Apr 03 2024 04:04:58 GMT+0800 (China Standard Time)

https://gist.github.com/vajonam/d1e713bcfd47e03f27549258ef53690e <- WIP but works for the most part, I have added some of your code. I think should be able to submit a PR. Standby not too different after all, ubunut/debian still a bit bloated.

vajonam · Answer 12 · Wed Apr 03 2024 04:05:28 GMT+0800 (China Standard Time)

still need to add back supervisord will work on that next.

vajonam · Answer 13 · Wed Apr 03 2024 12:27:58 GMT+0800 (China Standard Time)

okay I have working version with supervisor etc, some fine tuning is needed for passing environment variables. look for a PR shortly. this should work regardless of nvidia, but worst case you might have 2 dockerfiles one nvidia gpu and one for cpu.

vajonam · Answer 14 · Wed Apr 03 2024 21:44:02 GMT+0800 (China Standard Time)

Added #15 to address this.

Mike Helfrich · Answer 15 · Thu Apr 04 2024 05:27:22 GMT+0800 (China Standard Time)

Added #15 to address this.

Thanks for the work so far. I just pulled the latest commit(s) and I am unable to run this via CLI (for unraid and similar I am making sure the templates match up and trying to figure out the migration path for this set of changes).

My guess is this is due to the supervisord.conf changes

2024-04-03 17:25:13 Error: Format string '/opt/TurboVNC/bin/vncserver %(ENV_DISPLAY)s -fg  %(ENV_VNC_SEC)s -depth 24 -geometry %(ENV_VNC_RESOLUTION)s' for 'program:vnc.command' contains names ('ENV_VNC_RESOLUTION') which cannot be expanded. Available names: ENV_DEBIAN_FRONTEND, ENV_DISPLAY, ENV_HOME, ENV_HOSTNAME, ENV_LC_CTYPE, ENV_LD_LIBRARY_PATH, ENV_LOCALFBPORT, ENV_NOVNC_PORT, ENV_NVIDIA_DRIVER_CAPABILITIES, ENV_NVIDIA_VISIBLE_DEVICES, ENV_PATH, ENV_PWD, ENV_SHLVL, ENV_SSL_CERT_FILE, ENV_SUPD_LOGLEVEL, ENV_VGLRUN, ENV_VGL_DISPLAY, ENV_VNC_PORT, ENV_VNC_SEC, group_name, here, host_node_name, numprocs, process_num, program_name in section 'program:vnc' (file: '/etc/supervisord.conf')
2024-04-03 17:25:13 For help, use /usr/bin/supervisord -h

Command I am running FWIW:

docker run --detach --volume=prusaslicer-novnc-data:/configs/ --volume=prusaslicer-novnc-prints:/prints/ -p 8080:8080 -e SSL_CERT_FILE="/etc/ssl/certs/ca-certificates.crt" --gpus all --name=prusaslicer-novnc prusaslicer-novnc

Playing with this a bit more on my end, but once it's ready for review, let me know and I can take a pass :).

vajonam · Answer 16 · Thu Apr 04 2024 09:06:51 GMT+0800 (China Standard Time)

this is the environment variables I am passing

    prusaslicer:
      # image: mikeah/prusaslicer-novnc
      image: cr.localdomain.com/prusa-new
      container_name: prusaslicer
      environment:
        - SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt
        - NVIDIA_VISIBLE_DEVICES=1
        - NVIDIA_DRIVER_CAPABILITIES=all
        - VGL_DISPLAY=egl
        - SUPD_LOGLEVEL=INFO # TRACE
        - VNC_RESOLUTION=1900x1200
      volumes:
        - /opt/docker/configs/prusaslicer/config:/configs
        - /opt/docker/configs/prusaslicer/prints:/prints
      restart: unless-stopped

think you were missing VNC_RESOLUTION, It should default if not set, not sure why that is not happening will have a look. Be sure to add all the environment variables, you should be able to pass them as -e FOO=BAR

      environment:
        - SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt
        - NVIDIA_VISIBLE_DEVICES=1
        - NVIDIA_DRIVER_CAPABILITIES=all
        - VGL_DISPLAY=egl
        - SUPD_LOGLEVEL=INFO # TRACE
        - VNC_RESOLUTION=1900x1200

just added an export to make sure its defaulted if not mentioned.

vajonam · Answer 17 · Thu Apr 04 2024 09:13:39 GMT+0800 (China Standard Time)

I am assuming that you have a nvidia GPU on your installation. I haven't tested this without. but the image is nvidia image and needs nvidia-docker2 from what I understand.

vajonam · Answer 18 · Thu Apr 04 2024 20:24:30 GMT+0800 (China Standard Time)

@helfrichmichael did you get it running after the export of the param, not sure why I forgot that.. anyhow. I had a couple of questions / suggestions.

move to GTK3, performance seems quite good with EGL/VirtualGL accel, any reason you chose to stick with GTK2?
we can include superslicer in here too, I know you have a branch. I was thinking either this can be selected by a runtime env variable, which slicer to launch

Mike Helfrich · Answer 19 · Thu Apr 04 2024 21:33:02 GMT+0800 (China Standard Time)

@helfrichmichael did you get it running after the export of the param, not sure why I forgot that.. anyhow. I had a couple of questions / suggestions.

move to GTK3, performance seems quite good with EGL/VirtualGL accel, any reason you chose to stick with GTK2?

we can include superslicer in here too, I know you have a branch. I was thinking either this can be selected by a runtime env variable, which slicer to launch

Yep, once the param was exported, it worked just fine (I also had tried passing it as a command line env prior to this FWIW).

For Supeslicer and the other slicers, I am happy to replicate this over to those once I've reviewed and merged the code unless you have capacity to update those -- no pressure either way, but this work should be a great base to work from for GPU passthrough on these apps..

Ideally for now I think keeping them separate would be ideal just to prevent having to provide migration paths for those on existing unraid templates etc (I find it's a bit nuanced to do template updates TBH).

The only other thing I am curious about is figuring out a way to allow automatic VNC resizing as this has been immensely useful for me when I go from device to device (I have a Mimo Vue touchscreen on my desktop in the garage for the printers that is fairly low res for easy presses). I haven't looked into how noVNC accomplished this, but if we can solve for either autoresizing or a static size that would be amazing.

Thanks again @vajonam , really appreciate your help and dedication on this effort.

vajonam · Answer 20 · Thu Apr 04 2024 21:43:11 GMT+0800 (China Standard Time)

For Supeslicer and the other slicers, I am happy to replicate this over to those once I've reviewed and merged the code unless you have capacity to update those -- no pressure either way, but this work should be a great base to work from for GPU passthrough on these apps..

Excellent.

Ideally for now I think keeping them separate would be ideal just to prevent having to provide migration paths for those on existing unraid templates etc (I find it's a bit nuanced to do template updates TBH).

Agreed.

The only other thing I am curious about is figuring out a way to allow automatic VNC resizing as this has been immensely useful for me when I go from device to device (I have a Mimo Vue touchscreen on my desktop in the garage for the printers that is fairly low res for easy presses). I haven't looked into how noVNC accomplished this, but if we can solve for either autoresizing or a static size that would be amazing.

I am not sure I understand, but what it looks like it auto resized the window, sadly the right panel on prusaslicer isn't resize able might have to move to a modern view

Thanks again @vajonam , really appreciate your help and dedication on this effort.

Yeah no problem you're welcome. For the most part this was driven by need, I had some complex files a few MB that the software rendered just couldn't do when it came to 3D. This makes it awesome! The previous solution was good for the simple stuff.

To disable VirtualGL run by setting VGLRUN= and you should see it switch back to the MESA software render and older performance. I will change the name of param to ENABLEHWGPU=ture or something like that maybe to make it more user friendly. I have been using this for a past few days do some slicing and printing works really well!

Mike Helfrich · Answer 21 · Thu Apr 04 2024 21:49:53 GMT+0800 (China Standard Time)

Oh wait. I'm just opening the wrong VNC file I think (we should adjust the default file for the HTTP server probably if we can).

http://localhost:8080/vnc_lite.html?resize=true seemed to render it flawlessly! I am having an issue opening the vnc.html file so I need to look at that.

I am going to try to review this after work so I can give this a stamp of approval.

This is awesome to see so far along!

Mike Helfrich · Answer 22 · Thu Apr 04 2024 22:24:07 GMT+0800 (China Standard Time)

Oh wait. I'm just opening the wrong VNC file I think (we should adjust the default file for the HTTP server probably if we can).

http://localhost:8080/vnc_lite.html?resize=true seemed to render it flawlessly! I am having an issue opening the vnc.html file so I need to look at that.

I am going to try to review this after work so I can give this a stamp of approval.

This is awesome to see so far along!

To account for this, I will likely make the following PR:

Dockerfile:

# Add a default file to resize, etc for noVNC.
ADD vncresize.html /usr/share/novnc/index.html

vncresize.html:

<html>
    <head>
        <script>
            window.location.replace("./vnc.html?autoconnect=true&resize=remote&reconnect=true&show_dot=true");
        </script>
    </head>
</html>

Mike Helfrich · Answer 23 · Fri May 17 2024 11:16:31 GMT+0800 (China Standard Time)

@vajonam just pushed to Docker. Successfully set it up on my unraid server with an RTX 3070. It's not picking up the GPU though it seems so I need to dive into this a bit more.

Variables I set:

NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=all
ENABLEHWGPU=true

I'll keep poking at this when I have some time.

vajonam · Answer 24 · Fri May 17 2024 20:41:27 GMT+0800 (China Standard Time)

You might be missing some permissions on the host system, around device permissions. there is tool called vglserver_config that can help you set that up its part of the virtualgl package.

vajonam · Answer 25 · Fri May 17 2024 20:48:48 GMT+0800 (China Standard Time)

Does nvidia-smi -l show this line on the host?

|    1   N/A  N/A   2955483      G   /slic3r/slic3r-dist/bin/prusa-slicer         92MiB |

vajonam · Answer 26 · Fri May 17 2024 20:58:37 GMT+0800 (China Standard Time)

Just pulled your latest mage, and works nicely in my environment.

Mike Helfrich · Answer 27 · Fri May 17 2024 21:37:23 GMT+0800 (China Standard Time)

Does nvidia-smi -l show this line on the host?
|    1   N/A  N/A   2955483      G   /slic3r/slic3r-dist/bin/prusa-slicer         92MiB |

Sadly, no. I see "No running processes found" for all of the entities. In binhex-plexpass for example, I see the GPU passthrough just fine, etc. I can try to poke this more after work probably.

vajonam · Answer 28 · Fri May 17 2024 21:39:25 GMT+0800 (China Standard Time)

This is virtualGL passthru not GPU regular passthough which is bit different. Let me know what you find.

Mike Helfrich · Answer 29 · Fri May 17 2024 21:39:47 GMT+0800 (China Standard Time)

For full context here is my unraid variables surrounding GPU acceleration:

Additionally I tried running the container as privileged to no avail.

vajonam · Answer 30 · Fri May 17 2024 21:42:05 GMT+0800 (China Standard Time)

this is what I am using in my docker compose maybe you need to pass the egl

        - SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt
        - NVIDIA_VISIBLE_DEVICES=1
        - NVIDIA_DRIVER_CAPABILITIES=all
        - VGL_DISPLAY=egl
        - ENABLEHWGPU=true
        - SUPD_LOGLEVEL=INFO
        - VNC_RESOLUTION=1900x1200

vajonam · Answer 31 · Fri May 17 2024 21:42:44 GMT+0800 (China Standard Time)

These are important.

        - VGL_DISPLAY=egl
        - ENABLEHWGPU=true

Mike Helfrich · Answer 32 · Fri May 17 2024 21:46:47 GMT+0800 (China Standard Time)

These are important.

        - VGL_DISPLAY=egl
        - ENABLEHWGPU=true

Confirmed VGL_DISPLAY=egl doesn't change the behavior on my end for the nvidia-smi output or the docker container.

Regarding vglserver_config are you saying I need to set this up on the host (not the docker container)?

vajonam · Answer 33 · Fri May 17 2024 21:48:39 GMT+0800 (China Standard Time)

Yes. you need it on the host to ensure the devices have the right permissions to access to the card. All it does in this case is set up permissions on the cards and make sure the user under which the docker daemon is running can access the cards.

vajonam · Answer 34 · Fri May 17 2024 21:49:28 GMT+0800 (China Standard Time)

These are important.
        - VGL_DISPLAY=egl
        - ENABLEHWGPU=true
Confirmed VGL_DISPLAY=egl doesn't change the behavior on my end for the nvidia-smi output or the docker container.

Regarding vglserver_config are you saying I need to set this up on the host (not the docker container)?

Assuming you set ENABLEHWGPU to true as well?

Mike Helfrich · Answer 35 · Fri May 17 2024 21:55:22 GMT+0800 (China Standard Time)

Yes. you need it on the host to ensure the devices have the right permissions to access to the card. All it does in this case is set up permissions on the cards and make sure the user under which the docker daemon is running can access the cards.

Hmmm that might add complexity for unraid since I can't find that as a supported approach and I believe it spins up an X server if I'm not mistaken? I'll have to look into this after work.

These are important.
        - VGL_DISPLAY=egl
        - ENABLEHWGPU=true
Confirmed VGL_DISPLAY=egl doesn't change the behavior on my end for the nvidia-smi output or the docker container.
Regarding vglserver_config are you saying I need to set this up on the host (not the docker container)?
Assuming you set ENABLEHWGPU to true as well?

Correct, I have set both of those on my template.

vajonam · Answer 36 · Fri May 17 2024 21:57:11 GMT+0800 (China Standard Time)

There is no need with an X server on the host, it just uses EGL (virutalGL) to use the card to render it on the VNC based X server.

Mike Helfrich · Answer 37 · Sun May 19 2024 12:31:43 GMT+0800 (China Standard Time)

I got some time just now to play with this a bit more and the solution to my problems wasn't enabling anything further with VirtualGL/vglserver.

In-fact it was just adding --runtime=nvidia as an "Extra Parameters" and it's working flawlessly now.

Amazing work @vajonam!

Mike Helfrich · Answer 38 · Sun May 19 2024 12:40:48 GMT+0800 (China Standard Time)

Feel free to re-open this if anyone is experiencing issues, but I believe this is good to go 🥳 .