singularityware / singularityware.github.io

base documentation site for Singularity software

Home Page:https://singularityware.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Trying to create image from Nvidia registry with token

chrisreidy opened this issue · comments

Nvidia provides Docker images of ML code. My recipe is:

Bootstrap: docker
From: nvcr.io/nvidia/tensorflow:18.01-py3
Registry: nvcr.io
IncludeCmd: yes
Username: $oauthtoken
Password: APIkeyinsertedhere
mkdir /extra
mkdir /xdisk

And I get this:
singularity build ngc.tensorflow.18-01-py2.img ngctest
Using container recipe deffile: ngctest
Sanitizing environment
Adding base Singularity environment to container
/cm/shared/uaapps/singularity/2.4/libexec/singularity/functions: line 87: [: DEBUG: integer expression expected
ERROR Unrecognized authentication challenge, exiting.
Cleaning up...

Any advice on both fixing error and better designed recipe is appreciated as I plan to repeat this process many times
Thanks in advance
Chris

First, export the credentials to the environment

export SINGULARITY_DOCKER_USERNAME="\$oauthtoken"
export SINGULARITY_DOCKER_PASSWORD=xxxxxxxxxxxx

many times the issue is the missing tag, for example, in the above it would default to latest (and minimally we should check if that exists). Then to start even more simply, let's just get a pull working. The branch that you have to use (not merged) is this one to use the nvidia cloud:

apptainer/singularity#1184

And then just the pull should work like:

singularity pull docker://nvcr.io/nvidia/tensorflow:18.01-py3

A quick thing to try is sregistry:

pip install sregistry
sregistry pull nvidia://tensorflow:18.01-py3

And again you would need to define variables in the environment, see here:

https://singularityhub.github.io/sregistry-cli/client-nvidia

Thanks for the super prompt reply Vanessa. I will try that

I tried going through #1184 without much luck.
Using sregistry was the answer:
log in as user
module load python/2
module load singularity
export SREGISTRY_NVIDIA_TOKEN="apitokenhere"
sregistry pull nvidia://pytorch:18.01-py3
...
Success /home/...nvidia-pytorch:18.01-py3.simg

Thank you

Great! Note that the environment (e.g., changes to path) I don't think are working yet, I can't get my token working to figure out the right call to get the manifest. Hopefully they will have better docs for their API soon (they are pretty... missing... lol). As a workaround you can make a build recipe with those variables defined, or just define at runtime.

@vsoch Do you have a sample recipe for one of their Docker images per chance? Would I need to remake Singularity with the jtriley:nvcr-io-registry-fixes? I had trouble with that

Yep, you would need to install Singularity with that branch, meaning cloning it, and then doing the whole routine. I tried to do that to peek at how the headers were being handled, but I never get past my token being denied. I don't have an example recipe, but I'd be glad to help you if you have an image in mind. The biggest issue is that there isn't a nice place where all this is shared.

Ok - that makes sense. I will first play with what I created using sregistry and see how that works. Enjoy your weekend

@vsoch I seem to have got to the point where running the container is the issue. I have created singularity images two ways:

  1. sregistry pull nvidia://tensorflow:18.01-py3
    and run it with:
    singularity exec nvidia-tensorflow:18.01-py3.simg python ~/.singularity/shub/TFlow_example.py

  2. Build singularity from jtriley with the nvcr fixes.
    then build container:
    singularity build ngc.tensorflow.18-01-py2.img ngctest
    and then run it with
    singularity exec ngc.tensorflow.18-01-py2.img python

import tensorflow as tf

In both cases I get:
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
even if that library is in my path. Apparently it is not in the container path
Progress I guess
Sigh

Did you try with the --nv flag?

so like

singularity exec --nv $(sregistry get nvidia/tensorflow:18.01-py3) python -c "import tensorflow as tf"

I don't have any of those drivers so I can't test on my local machine, I get the same error!

Don't give up @chrisreidy !! These are just stupid containers. With enough eyes.... all bugs are cute and squishy and you come to wave when you see them again!

And sometime edible if you are so inclined!
Inserting --nv was the trick. Thanks
So the container pulled with sregistry was not created with a recipe so I do not have any custom bind points, and I have to use a default file path.
The container pulled from the singularity that I made from "jtriley/singularity" has to use the same singularity for execution so I cannot use that unless I install it on the compute nodes in place of the standard version. That would be possible since we use modules but not desirable
So I am not quite there yet

Here is the result of running the sregistry version on a GPU compute node
singularity exec --nv nvidia-tensorflow:18.01-py3.simg python /home/u13/chrisreidy/.singularity/shub/TFlow_example.py
which: no nvidia-smi in (/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin)
WARNING: Could not find the Nvidia SMI binary to bind into container
WARNING: Non existent bind point (directory) in container: '/extra'
WARNING: Non existent bind point (directory) in container: '/xdisk'
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/util/tf_should_use.py:107: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use tf.global_variables_initializer instead.
2018-02-21 16:47:19.233527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:0b:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-02-21 16:47:19.233573: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:0b:00.0, compute capability: 6.0)
0 [ 0.76190269] [-0.09601036]
20 [ 0.29777145] [ 0.19185154]
40 [ 0.15783976] [ 0.26837116]
60 [ 0.11691568] [ 0.29074991]
80 [ 0.10494711] [ 0.29729477]
100 [ 0.10144684] [ 0.29920882]
120 [ 0.10042314] [ 0.29976863]
140 [ 0.10012376] [ 0.29993233]
160 [ 0.1000362] [ 0.29998022]
180 [ 0.1000106] [ 0.29999423]
200 [ 0.10000309] [ 0.29999831]

So why not bootstrap that nvidia container (the base for your jtriley) and make the bind points, then just pull it?

Intriguing. Could you provide a little more detail please? I am not a singularity wizard.

Just make the nvidia container your base when you build. This is in your Singularity recipe file:

Bootstrap: docker
From: nvcr.io/nvidia/tensorflow:18.01-py3

%post
    mkdir -p /my/special/bind

You can add / customize whatever about the nvidia container that you need. If you want to test editing the container, pull a writable one (on your local machine).

sudo singularity build --writable tf-ext3-img nvcr.io/nvidia/tensorflow:18.01-py3
sudo singularity build --sandbox [tensorflow] nvcr.io/nvidia/tensorflow:18.01-py3

The top builds ext3, the bottom is a sandbox (folder). Then you can shell in with --writable and test actually running and making changes. Add the commands you like to your recipe, then build it "for reals" and give it a go on your (read only) cluster.

Thank you. I will try that out

@vsoch The container I built from the jtriley singularity was made with a recipe similar to above and it works fine, but only if it runs using the jtriley version of singularity.
The one I would like to modify binding on is the one I obtained by "sregistry pull nvidia-tensorflow:18.01-py3.
Or, update our main singularity module if the nvcr-io-registry-fixes is included
(sorry for slow replies - I get distracted by other work thingies)

Oh yes you can do that too! Instead of a docker bootstrap do localimage and then provide the path of the image file (in the recipe).

I have never tried that - it makes sense conceptually. I will try it after my upcoming meeting
I keep bugging you - still soft and squishy hopefully (see earlier post :-) )
Thanks

Here are the sparse docs for it, it should work I hope! http://singularity.lbl.gov/build-localimage

@vsoch I made a new image but get the same result using this recipe:
Bootstrap: localimage
From: /home/u13/chrisreidy/.singularity/shub/nvidia-tensorflow:18.01-py3.simg

mkdir /extra
mkdir /xdisk

Which worked on the container that I created from sregistry:
singularity build nvidia-tensorflow.simg localtest
...
Building Singularity image...
Singularity container built: nvidia-tensorflow.simg
Cleaning up...

When I run it, it immediately complains about bind points:
singularity exec --nv nvidia-tensorflow.simg python
which: no nvidia-smi in (/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin)
WARNING: Could not find the Nvidia SMI binary to bind into container
WARNING: Non existent bind point (directory) in container: '/extra'
WARNING: Non existent bind point (directory) in container: '/xdisk'
Python 3.5.2 (default, Nov 23 2017, 16:37:01)

So when I run it against my test input in the /extra path it does not work. The same file in /home does work.
I tried to do a localimage build on the jtriley image, but it complained about the version mismatch, which happens when I run the container made with jtriley singularity and run on the main singularity module

So I am no closer currently, but I feel I am close to success (he says optimistically)

Can you show me the full recipe?

Um well, this is it:
Bootstrap: localimage
From: /home/u13/chrisreidy/.singularity/shub/nvidia-tensorflow:18.01-py3.simg

mkdir /extra
mkdir /xdisk

I modified this one which seemed to work:
Bootstrap: docker
From: nvcr.io/nvidia/tensorflow:18.01-py3
Registry: nvcr.io
IncludeCmd: yes

mkdir /extra
mkdir /xdisk

ah! Try this:

Bootstrap: localimage
From: /home/u13/chrisreidy/.singularity/shub/nvidia-tensorflow:18.01-py3.simg

%post
    mkdir /extra 
    mkdir /xdisk

Oh. I got lazy. Check this out:

singularity exec --nv nvidia-tensorflow.simg python TFlow_example.py
which: no nvidia-smi in (/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin)
WARNING: Could not find the Nvidia SMI binary to bind into container
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/util/tf_should_use.py:107: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use tf.global_variables_initializer instead.
2018-02-22 04:18:29.091637: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:0b:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-02-22 04:18:29.091682: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:0b:00.0, compute capability: 6.0)
0 [-0.26802459] [ 0.74230778]
20 [-0.00335506] [ 0.35877749]
40 [ 0.0771822] [ 0.31297639]
60 [ 0.09496249] [ 0.30286482]
80 [ 0.09888787] [ 0.30063248]
100 [ 0.09975447] [ 0.30013964]
120 [ 0.09994578] [ 0.30003086]
140 [ 0.09998804] [ 0.30000681]
160 [ 0.09999739] [ 0.3000015]
180 [ 0.09999942] [ 0.30000034]
200 [ 0.09999987] [ 0.30000007]

So my flow will be to pull the nvidia images from the hub using
export SREGISTRY_NVIDIA_TOKEN
Then run it through localfile to add the bind point, then make it available for users. Yay!
Thank you