Cotainr fails to build container when /tmp runs out of space
Chroxvi opened this issue · comments
If you try to build a container using cotainr, e.g. cotainr build lumi_pytorch_rocm_demo.sif --system=lumi-g --conda-env py311_rocm542_pytorch.yml
, on a system which does not have sufficient space on /tmp to store the entire container, you will encounter an error like:
...
INFO: Creating sandbox directory...
INFO: Build complete: /tmp/tmpobd2znt4/singularity_sandbox
Pip subprocess error:
ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device
...
Cotainr does not provide a CLI option, environment variable, or similar for changing the location of the temporary sandbox directory, created during the build phase, to another location than /tmp. This might be a problem if:
- You are building a very large container that does not fit /tmp on your system. In that case you can't use cotainr.
- You are building containers on a shared system, e.g. a shared HPC-system login-node, where other people might coincidentally use /tmp. In that case you might experience occasional build failures.
If this happens on a login-node on the LUMI supercomputer, as a workaround, you can try to build your container on another node - either another login node or a compute node. To use a LUMI-C compute node, you can submit a job using srun
, e.g. something along the lines of: srun --account=project_<your_project_id> --time=00:15:00 --mem=64G --cpus-per-task=32 --partition=small cotainr build lumi_pytorch_rocm_demo.sif --system=lumi-g --conda-env py311_rocm542_pytorch.yml
. Note that you have to request enough memory, since /tmp is mounted as an in-memory filesystem on LUMI compute nodes. Also note that building on a compute node will consume CPUh/GPUh resources from your LUMI project.