Trying to create image from Nvidia registry with token

Question

Trying to create image from Nvidia registry with token

chrisreidy opened this issue 7 years ago · comments

Nvidia provides Docker images of ML code. My recipe is:

Bootstrap: docker
From: nvcr.io/nvidia/tensorflow:18.01-py3
Registry: nvcr.io
IncludeCmd: yes
Username: $oauthtoken
Password: APIkeyinsertedhere
mkdir /extra
mkdir /xdisk

And I get this:
singularity build ngc.tensorflow.18-01-py2.img ngctest
Using container recipe deffile: ngctest
Sanitizing environment
Adding base Singularity environment to container
/cm/shared/uaapps/singularity/2.4/libexec/singularity/functions: line 87: [: DEBUG: integer expression expected
ERROR Unrecognized authentication challenge, exiting.
Cleaning up...

Any advice on both fixing error and better designed recipe is appreciated as I plan to repeat this process many times
Thanks in advance
Chris

Vanessasaurus commented 7 years ago

🤗

Vanessasaurus · Answer 1 · Fri Feb 16 2018 07:47:37 GMT+0800 (China Standard Time)

First, export the credentials to the environment

export SINGULARITY_DOCKER_USERNAME="\$oauthtoken"
export SINGULARITY_DOCKER_PASSWORD=xxxxxxxxxxxx

many times the issue is the missing tag, for example, in the above it would default to latest (and minimally we should check if that exists). Then to start even more simply, let's just get a pull working. The branch that you have to use (not merged) is this one to use the nvidia cloud:

apptainer/singularity#1184

And then just the pull should work like:

singularity pull docker://nvcr.io/nvidia/tensorflow:18.01-py3

A quick thing to try is sregistry:

pip install sregistry
sregistry pull nvidia://tensorflow:18.01-py3

And again you would need to define variables in the environment, see here:

https://singularityhub.github.io/sregistry-cli/client-nvidia

chrisreidy · Answer 2 · Fri Feb 16 2018 08:26:59 GMT+0800 (China Standard Time)

Thanks for the super prompt reply Vanessa. I will try that

chrisreidy · Answer 3 · Sat Feb 17 2018 07:08:20 GMT+0800 (China Standard Time)

I tried going through #1184 without much luck.
Using sregistry was the answer:
log in as user
module load python/2
module load singularity
export SREGISTRY_NVIDIA_TOKEN="apitokenhere"
sregistry pull nvidia://pytorch:18.01-py3
...
Success /home/...nvidia-pytorch:18.01-py3.simg

Thank you

Vanessasaurus · Answer 4 · Sat Feb 17 2018 07:12:33 GMT+0800 (China Standard Time)

Great! Note that the environment (e.g., changes to path) I don't think are working yet, I can't get my token working to figure out the right call to get the manifest. Hopefully they will have better docs for their API soon (they are pretty... missing... lol). As a workaround you can make a build recipe with those variables defined, or just define at runtime.

chrisreidy · Answer 5 · Sat Feb 17 2018 07:19:52 GMT+0800 (China Standard Time)

@vsoch Do you have a sample recipe for one of their Docker images per chance? Would I need to remake Singularity with the jtriley:nvcr-io-registry-fixes? I had trouble with that

Vanessasaurus · Answer 6 · Sat Feb 17 2018 07:54:41 GMT+0800 (China Standard Time)

Yep, you would need to install Singularity with that branch, meaning cloning it, and then doing the whole routine. I tried to do that to peek at how the headers were being handled, but I never get past my token being denied. I don't have an example recipe, but I'd be glad to help you if you have an image in mind. The biggest issue is that there isn't a nice place where all this is shared.

chrisreidy · Answer 7 · Sat Feb 17 2018 09:25:41 GMT+0800 (China Standard Time)

Ok - that makes sense. I will first play with what I created using sregistry and see how that works. Enjoy your weekend

chrisreidy · Answer 8 · Wed Feb 21 2018 06:07:09 GMT+0800 (China Standard Time)

@vsoch I seem to have got to the point where running the container is the issue. I have created singularity images two ways:

sregistry pull nvidia://tensorflow:18.01-py3
and run it with:
singularity exec nvidia-tensorflow:18.01-py3.simg python ~/.singularity/shub/TFlow_example.py
Build singularity from jtriley with the nvcr fixes.
then build container:
singularity build ngc.tensorflow.18-01-py2.img ngctest
and then run it with
singularity exec ngc.tensorflow.18-01-py2.img python

import tensorflow as tf

In both cases I get:
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
even if that library is in my path. Apparently it is not in the container path
Progress I guess
Sigh

Vanessasaurus · Answer 9 · Wed Feb 21 2018 06:33:57 GMT+0800 (China Standard Time)

Did you try with the --nv flag?

Vanessasaurus · Answer 10 · Wed Feb 21 2018 06:36:22 GMT+0800 (China Standard Time)

so like

singularity exec --nv $(sregistry get nvidia/tensorflow:18.01-py3) python -c "import tensorflow as tf"

Vanessasaurus · Answer 11 · Wed Feb 21 2018 06:37:02 GMT+0800 (China Standard Time)

I don't have any of those drivers so I can't test on my local machine, I get the same error!

Vanessasaurus · Answer 12 · Wed Feb 21 2018 06:42:04 GMT+0800 (China Standard Time)

Don't give up @chrisreidy !! These are just stupid containers. With enough eyes.... all bugs are cute and squishy and you come to wave when you see them again!

chrisreidy · Answer 13 · Thu Feb 22 2018 00:57:54 GMT+0800 (China Standard Time)

And sometime edible if you are so inclined!
Inserting --nv was the trick. Thanks
So the container pulled with sregistry was not created with a recipe so I do not have any custom bind points, and I have to use a default file path.
The container pulled from the singularity that I made from "jtriley/singularity" has to use the same singularity for execution so I cannot use that unless I install it on the compute nodes in place of the standard version. That would be possible since we use modules but not desirable
So I am not quite there yet

Here is the result of running the sregistry version on a GPU compute node
singularity exec --nv nvidia-tensorflow:18.01-py3.simg python /home/u13/chrisreidy/.singularity/shub/TFlow_example.py
which: no nvidia-smi in (/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin)
WARNING: Could not find the Nvidia SMI binary to bind into container
WARNING: Non existent bind point (directory) in container: '/extra'
WARNING: Non existent bind point (directory) in container: '/xdisk'
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/util/tf_should_use.py:107: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use tf.global_variables_initializer instead.
2018-02-21 16:47:19.233527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:0b:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-02-21 16:47:19.233573: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:0b:00.0, compute capability: 6.0)
0 [ 0.76190269] [-0.09601036]
20 [ 0.29777145] [ 0.19185154]
40 [ 0.15783976] [ 0.26837116]
60 [ 0.11691568] [ 0.29074991]
80 [ 0.10494711] [ 0.29729477]
100 [ 0.10144684] [ 0.29920882]
120 [ 0.10042314] [ 0.29976863]
140 [ 0.10012376] [ 0.29993233]
160 [ 0.1000362] [ 0.29998022]
180 [ 0.1000106] [ 0.29999423]
200 [ 0.10000309] [ 0.29999831]

Vanessasaurus · Answer 14 · Thu Feb 22 2018 01:15:18 GMT+0800 (China Standard Time)

So why not bootstrap that nvidia container (the base for your jtriley) and make the bind points, then just pull it?

chrisreidy · Answer 15 · Thu Feb 22 2018 01:34:34 GMT+0800 (China Standard Time)

Intriguing. Could you provide a little more detail please? I am not a singularity wizard.

Vanessasaurus · Answer 16 · Thu Feb 22 2018 01:44:32 GMT+0800 (China Standard Time)

Just make the nvidia container your base when you build. This is in your Singularity recipe file:

Bootstrap: docker
From: nvcr.io/nvidia/tensorflow:18.01-py3

%post
    mkdir -p /my/special/bind

You can add / customize whatever about the nvidia container that you need. If you want to test editing the container, pull a writable one (on your local machine).

sudo singularity build --writable tf-ext3-img nvcr.io/nvidia/tensorflow:18.01-py3
sudo singularity build --sandbox [tensorflow] nvcr.io/nvidia/tensorflow:18.01-py3

The top builds ext3, the bottom is a sandbox (folder). Then you can shell in with --writable and test actually running and making changes. Add the commands you like to your recipe, then build it "for reals" and give it a go on your (read only) cluster.

chrisreidy · Answer 17 · Thu Feb 22 2018 01:48:35 GMT+0800 (China Standard Time)

Thank you. I will try that out

chrisreidy · Answer 18 · Thu Feb 22 2018 03:47:49 GMT+0800 (China Standard Time)

@vsoch The container I built from the jtriley singularity was made with a recipe similar to above and it works fine, but only if it runs using the jtriley version of singularity.
The one I would like to modify binding on is the one I obtained by "sregistry pull nvidia-tensorflow:18.01-py3.
Or, update our main singularity module if the nvcr-io-registry-fixes is included
(sorry for slow replies - I get distracted by other work thingies)

Vanessasaurus · Answer 19 · Thu Feb 22 2018 03:50:00 GMT+0800 (China Standard Time)

Oh yes you can do that too! Instead of a docker bootstrap do localimage and then provide the path of the image file (in the recipe).

chrisreidy · Answer 20 · Thu Feb 22 2018 03:54:17 GMT+0800 (China Standard Time)

I have never tried that - it makes sense conceptually. I will try it after my upcoming meeting
I keep bugging you - still soft and squishy hopefully (see earlier post :-) )
Thanks

Vanessasaurus · Answer 21 · Thu Feb 22 2018 04:21:31 GMT+0800 (China Standard Time)

Here are the sparse docs for it, it should work I hope! http://singularity.lbl.gov/build-localimage

chrisreidy · Answer 22 · Thu Feb 22 2018 08:13:26 GMT+0800 (China Standard Time)

@vsoch I made a new image but get the same result using this recipe:
Bootstrap: localimage
From: /home/u13/chrisreidy/.singularity/shub/nvidia-tensorflow:18.01-py3.simg

mkdir /extra
mkdir /xdisk

Which worked on the container that I created from sregistry:
singularity build nvidia-tensorflow.simg localtest
...
Building Singularity image...
Singularity container built: nvidia-tensorflow.simg
Cleaning up...

When I run it, it immediately complains about bind points:
singularity exec --nv nvidia-tensorflow.simg python
which: no nvidia-smi in (/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin)
WARNING: Could not find the Nvidia SMI binary to bind into container
WARNING: Non existent bind point (directory) in container: '/extra'
WARNING: Non existent bind point (directory) in container: '/xdisk'
Python 3.5.2 (default, Nov 23 2017, 16:37:01)

So when I run it against my test input in the /extra path it does not work. The same file in /home does work.
I tried to do a localimage build on the jtriley image, but it complained about the version mismatch, which happens when I run the container made with jtriley singularity and run on the main singularity module

So I am no closer currently, but I feel I am close to success (he says optimistically)

Vanessasaurus · Answer 23 · Thu Feb 22 2018 08:26:38 GMT+0800 (China Standard Time)

Can you show me the full recipe?

chrisreidy · Answer 24 · Thu Feb 22 2018 08:36:40 GMT+0800 (China Standard Time)

Um well, this is it:
Bootstrap: localimage
From: /home/u13/chrisreidy/.singularity/shub/nvidia-tensorflow:18.01-py3.simg

mkdir /extra
mkdir /xdisk

I modified this one which seemed to work:
Bootstrap: docker
From: nvcr.io/nvidia/tensorflow:18.01-py3
Registry: nvcr.io
IncludeCmd: yes

mkdir /extra
mkdir /xdisk

Vanessasaurus · Answer 25 · Thu Feb 22 2018 08:56:46 GMT+0800 (China Standard Time)

ah! Try this:

Bootstrap: localimage
From: /home/u13/chrisreidy/.singularity/shub/nvidia-tensorflow:18.01-py3.simg

%post
    mkdir /extra 
    mkdir /xdisk

chrisreidy · Answer 26 · Thu Feb 22 2018 12:26:07 GMT+0800 (China Standard Time)

Oh. I got lazy. Check this out:

singularity exec --nv nvidia-tensorflow.simg python TFlow_example.py
which: no nvidia-smi in (/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin)
WARNING: Could not find the Nvidia SMI binary to bind into container
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/util/tf_should_use.py:107: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use tf.global_variables_initializer instead.
2018-02-22 04:18:29.091637: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:0b:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-02-22 04:18:29.091682: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:0b:00.0, compute capability: 6.0)
0 [-0.26802459] [ 0.74230778]
20 [-0.00335506] [ 0.35877749]
40 [ 0.0771822] [ 0.31297639]
60 [ 0.09496249] [ 0.30286482]
80 [ 0.09888787] [ 0.30063248]
100 [ 0.09975447] [ 0.30013964]
120 [ 0.09994578] [ 0.30003086]
140 [ 0.09998804] [ 0.30000681]
160 [ 0.09999739] [ 0.3000015]
180 [ 0.09999942] [ 0.30000034]
200 [ 0.09999987] [ 0.30000007]

So my flow will be to pull the nvidia images from the hub using
export SREGISTRY_NVIDIA_TOKEN
Then run it through localfile to add the bind point, then make it available for users. Yay!
Thank you