Unable to build a Jupyter Image with GPU support

Question

Unable to build a Jupyter Image with GPU support

altruistcoder opened this issue 2 years ago · comments

Hello,

I have been working on building a GPU enabled Jupyter Image to build a Tensorflow supported Jupyter Image following the blog provided here:

https://cschranz.medium.com/set-up-your-own-gpu-based-jupyterlab-e0d45fcacf43

I have attached the Dockerfile as well which I am using to build the image for your reference.
Dockerfile.txt

But, I am facing below error when I am trying to build the image:

I tried to build a PyTorch Image as well similar to this but that also got stuck at the same point.
I tried several things to get rid of this error but haven't been successful yet.

Can you please try to help me resolve this error and build the image successfully?

Michael Pilosov · Answer 1 · Thu Nov 17 2022 04:42:50 GMT+0800 (China Standard Time)

oh well that's fun... a conda env error it seems. some of the most fun to resolve.

I haven't built the image in about a month or two but before I dive into debugging, can you please provide a little more information?

are you able to build the unmodified version of the image?
can you please paste the error logs you're seeing from your modified image? the screenshot gives me a hint but I'd need to go up the stacktrace to really verify the above hypothesis.

Michael Pilosov · Answer 2 · Thu Nov 17 2022 04:44:37 GMT+0800 (China Standard Time)

it would also help to paste the output of diff <original dockerfile> <your dockerfile>

a note from personal experience:
I generally build my custom images by pulling from a working base image (such as the ones we publish) and then making modifications on top of that (i.e. a brand new Dockerfile suited for your project), and would suggest trying that as a debugging step to see if it's something that changed with the package environments in the original docker image or if your modifications are what's causing the error.

Rishabh Aggarwal · Answer 3 · Thu Nov 17 2022 21:56:55 GMT+0800 (China Standard Time)

Hello @mathematicalmichael

First of all thanks a lot for quick response.

Yes, I am able to build an image using the unmodified version of the Dockerfile.

I have attached the logs of the complete error I am getting while building my modified image below:
logs.txt

The output of the command diff <original dockerfile> <your dockerfile> is also given below:

Actually, I gave a thought to the point you have mentioned of adding my custom changes on top of the image that is created using the default Dockerfile generated using the shell script provided by you. But, the problem is I need to create separate images for PyTorch and Tensorflow and too with specific versions of them.

The default Dockerfile which is generated by your shell script consists of the base image as nvidia/cuda:11.6.2-cudnn8-runtime-ubuntu20.04 which have CUDA's version 11.6 but I need to build images for latest 3 versions of Tensorflow & PyTorch and according to Tensorflow & PyTorch's docs there is a requirement of having specific versions of CUDA & Cudnn for each version of Tensorflow & PyTorch as is clear from the following table provided by Tensorflow here:

This is the reason I cannot build on top of the image created using default Dockerfile and I need to update the base image right in the original Dockerfile itself.

Christoph · Answer 4 · Thu Nov 17 2022 22:19:52 GMT+0800 (China Standard Time)

@altruistcoder
Actually, Tensorflow was working with CUDA 11.6 and officially supported this version. It is very annoying that Tensorflow is removing the compatibility. Have you already tested TF with the original Dockerfile?

The Conda installation is also part of the sub-repository (https://github.com/jupyter/docker-stacks). You could test older installations of docker-stacks that may be suited for the base image. Use the argument --commit for this to specify a specific docker-stacks git commit.
There are also multiple pre-built images for CUDA 11.2, that you can find here: https://hub.docker.com/r/cschranz/gpu-jupyter/tags
Based on them you can reinstall your specific installation or put everything in an own Dockerfile, like:

FROM cschranz/gpu-jupyter:v1.4_cuda-11.0_ubuntu-18.04_python-only
RUN pip install --upgrade pip && \
    pip install --no-cache-dir "tensorflow==2.8.2"

Rishabh Aggarwal · Answer 5 · Thu Nov 17 2022 23:13:24 GMT+0800 (China Standard Time)

Hello @ChristophSchranz,

Thanks for providing support to solve this issue.

First of all, I just want to have your opinion that whether this error is primarily coming up due to the base image I am using or due to the extra python dependencies that I am installing?

I am asking this question because I had a similar Dockerfile where I was trying to install PyTorch 1.13 alongwith this extra dependencies (without Tensorflow) using the base image nvidia/cuda:11.7.1-cudnn8-devel-ubuntu22.04 and there also I got the same error.
Not only this I tried to build two images having no GPU support and having the base images as ubuntu:20.04 & ubuntu:22.04 respectively and still it was giving same error.

Does all the latest versions of Tensorflow works with CUDA 11.6?

Also, can you please guide me where do I have to specify the --commit flag along with git commit for docker-stacks repository?

I could definitely use your custom built image for CUDA 11.2 but I believe it would have both Tensorflow & PyTorch installed simultaneously which is not my requirement right now. Although definitely I can uninstall one of them but I was trying to make it work so that it works for all future versions as well.

Michael Pilosov · Answer 6 · Fri Nov 18 2022 03:49:40 GMT+0800 (China Standard Time)

quick input re: your question after parsing the log you added.

I think it's most likely because of the environment you're trying to set up. there's some conflicts it's unhappy about.
one thing to try is potentially switching conda out for micromamba, which generally resolves environments more ... gracefully... I'd just test it in a minimal docker image with your environment to see what's going on with resolution (could just be that you need to downgrade to a different python for example).

Using an older base image should also result in an older python, so you'd get similar information from trying what Christoph suggested.

Michael Pilosov · Answer 7 · Fri Nov 18 2022 03:51:29 GMT+0800 (China Standard Time)

also, regarding your use case: you say you dont need torch and tflow installed in the same image. I dunno how you're running your setup but the docker-stacks are particularly helpful when using a docker-based jupyterhub (e.g. https://github.com/ml-starter-packs/jupyterhub-deploy-docker), where you can have multiple servers running simultaneously in isolated workspaces and have a dropdown menu of images to select from when spawning each one.
So if you're looking for something browser-based to support a multitude of projects each with their own dependency set, I would highly suggest checking out jupyterhub in lieu of spawning the images yourself from command-line.

Rishabh Aggarwal · Answer 8 · Tue Nov 29 2022 16:48:31 GMT+0800 (China Standard Time)

Hello All,

Just to provide an update on this issue. I have been able to find the root cause of the issue.
The issue was coming up primarily from the extra python dependencies which I am trying to install while creating the notebook image. I was installing seldon-core module which was causing this error.
Although I am not aware why this particular package is causing this issue because earlier it used to work just fine. Maybe it's updated version of this package which is causing the issue.

Also @mathematicalmichael, I am aware about the JupyterHub and have used it for quite some time but currently my requirement is something else because of which I am trying to build individual images for TF & PyTorch having GPU support.

Neverthless, thanks a lot for providing me quick support and all the help @ChristophSchranz @mathematicalmichael