Cotainr fails to build container when /tmp runs out of space

Question

Cotainr fails to build container when /tmp runs out of space

Chroxvi opened this issue 6 months ago · comments

Christian Schou Oxvig commented 6 months ago

If you try to build a container using cotainr, e.g. cotainr build lumi_pytorch_rocm_demo.sif --system=lumi-g --conda-env py311_rocm542_pytorch.yml, on a system which does not have sufficient space on /tmp to store the entire container, you will encounter an error like:

...
INFO:    Creating sandbox directory...
INFO:    Build complete: /tmp/tmpobd2znt4/singularity_sandbox
Pip subprocess error:
ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device
...

Cotainr does not provide a CLI option, environment variable, or similar for changing the location of the temporary sandbox directory, created during the build phase, to another location than /tmp. This might be a problem if:

You are building a very large container that does not fit /tmp on your system. In that case you can't use cotainr.
You are building containers on a shared system, e.g. a shared HPC-system login-node, where other people might coincidentally use /tmp. In that case you might experience occasional build failures.

Christian Schou Oxvig · Answer 1 · Mon Jan 22 2024 17:24:58 GMT+0800 (China Standard Time)

If this happens on a login-node on the LUMI supercomputer, as a workaround, you can try to build your container on another node - either another login node or a compute node. To use a LUMI-C compute node, you can submit a job using srun, e.g. something along the lines of: srun --account=project_<your_project_id> --time=00:15:00 --mem=64G --cpus-per-task=32 --partition=small cotainr build lumi_pytorch_rocm_demo.sif --system=lumi-g --conda-env py311_rocm542_pytorch.yml. Note that you have to request enough memory, since /tmp is mounted as an in-memory filesystem on LUMI compute nodes. Also note that building on a compute node will consume CPUh/GPUh resources from your LUMI project.