NVIDIA / libnvidia-container

NVIDIA container runtime library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Debian 11 repo is broken

MKrupauskas opened this issue · comments

The commit 2dff280 restructured the Debian 11 repo in a breaking way.

Previously the below setup used to work, now it's failing:

root@host:/# cat /etc/apt/sources.list | grep nvidia
deb [arch=amd64] http://aptarchive.uber.internal/libnvidia-container/debian11/amd64 /

root@host:/# apt update
...
E: The repository 'http://repo.internal/libnvidia-container/debian11/amd64  Release' does not have a Release file.

this is because /debian11 used to symlink to /stable/debian11 which used to symlink /stable/debian10 which contained the amd64 directory with the .deb builds. https://github.com/NVIDIA/libnvidia-container/tree/9ce31ae4f042508cd8aabfad6168114c1cde30f0

/debian10, /debian11, /stable/debian11 should all have amd64 symlinks ultimately pointing to /stable/debian10/amd64

@MKrupauskas would switching to /libnvidia-container/debian10/amd64 as the source of truth for the package be a solution on your end?

Our intent with the official documentation was to make the downloading the repository list file work across different distributions, but the .list files would locally refer to the lowest compatible distribution for a given package flavor.

In the Debian case, this is debian10. The motivation for the changes that are causing the breakages are called out in NVIDIA/nvidia-container-toolkit#89 (comment)

All these user complaints would be solved with a symlink 😉

All these user complaints would be solved with a symlink 😉

@jonathanjsimon it's not quite a simple as that. A symlink duplicates the contents of the target folder at the link location when publishing these repos through GitHub pages. The reason this optimisation was performed was that the resultant artifact is already too large, causing the pages deployment to fail meaning that new packages are not available.

We are aware that there may be ways to increase the timeout using custom pages deployments. If you have experience in how to do this, suggestions are welcome.

While we did work around the issue by pointing our source list to Debian 10 the solution isn't ideal. If the only issue is the artifact size and build timeouts I think we should address that for the sake of having a Debian repo that matches the repo standard and user expectations.

Could you share some logs on what exactly times out if we correctly symlink the distribution directories? Looking at github action docs the steps themselves shouldn't time out for 360m if the default isn't overridden https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepstimeout-minutes

@MKrupauskas I have made the symlink changes to my personal mirror elezar@98ee43d.

The GitHub actions deploying this is here:

A previous action shows the archive size warning:

The following is an example of a deployment that failed due to a timeout, although this was using the "Deploy from branch" pages deployment and not an explicit workflow as we are using now.

We have updated our repository structure and installation instructions to make use of generic debian packages. The distribution name no longer affects the instructions.

Please see https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html and reopen this issue if there are still problems.

Hi there,
when using tools like apt-mirror or apt-mirror2 the file Packages always is empty after being downloaded from https://nvidia.github.io/libnvidia-container/stable/deb/amd64/Packages, but works in a browser. Do you have any idea where to search for a solution?

@HenriWahl I don't know what apt-mirror expects. This is the file tree as deployed to GitHub pages: https://github.com/NVIDIA/libnvidia-container/tree/gh-pages/stable/deb/amd64

If there is additional metadata required by the toolking we could consider adding it.

@elezar I am not sure what is missing, looks good to me.
The only hint I have that it works with https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64, maybe there is some difference.

Edit: yes, there are some differences:

Edit 2: I found this being an older problem: NVIDIA/nvidia-docker#730

Those are useful pointers. I will spend some time investigating this.

I have just tried the following in a clean ubuntu container:

  1. Installed apt-mirror
  2. Edit /etc/apt/mirror.list to only reference:
deb https://nvidia.github.io/libnvidia-container/experimental/deb/amd64 /
  1. When running apt-mirror I then see:
Processing indexes: [Psh: 1: xz: not found
]
  1. I then installed xz-utils:
apt-get install -y xz-utils
  1. When I now ran apt-mirror the repo is mirrored:
$ ls /var/spool/apt-mirror/mirror/
nvidia.github.io
  1. And in the folders themselves:
ls /var/spool/apt-mirror/mirror/nvidia.github.io/libnvidia-container/experimental/deb/amd64/
Packages                                           libnvidia-container-tools_1.15.0~rc.3-1_amd64.deb  nvidia-container-toolkit-base_1.14.0~rc.2-1_amd64.deb
Packages.xz                                        libnvidia-container1-dbg_1.14.0~rc.2-1_amd64.deb   nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb
libnvidia-container-dev_1.14.0~rc.2-1_amd64.deb    libnvidia-container1-dbg_1.15.0~rc.1-1_amd64.deb   nvidia-container-toolkit-base_1.15.0~rc.2-1_amd64.deb
libnvidia-container-dev_1.15.0~rc.1-1_amd64.deb    libnvidia-container1-dbg_1.15.0~rc.2-1_amd64.deb   nvidia-container-toolkit-base_1.15.0~rc.3-1_amd64.deb
libnvidia-container-dev_1.15.0~rc.2-1_amd64.deb    libnvidia-container1-dbg_1.15.0~rc.3-1_amd64.deb   nvidia-container-toolkit_1.14.0~rc.2-1_amd64.deb
libnvidia-container-dev_1.15.0~rc.3-1_amd64.deb    libnvidia-container1_1.14.0~rc.2-1_amd64.deb       nvidia-container-toolkit_1.15.0~rc.1-1_amd64.deb
libnvidia-container-tools_1.14.0~rc.2-1_amd64.deb  libnvidia-container1_1.15.0~rc.1-1_amd64.deb       nvidia-container-toolkit_1.15.0~rc.2-1_amd64.deb
libnvidia-container-tools_1.15.0~rc.1-1_amd64.deb  libnvidia-container1_1.15.0~rc.2-1_amd64.deb       nvidia-container-toolkit_1.15.0~rc.3-1_amd64.deb
libnvidia-container-tools_1.15.0~rc.2-1_amd64.deb  libnvidia-container1_1.15.0~rc.3-1_amd64.deb

Could you confirm that xz-utils is installed on your system?

Hi @elezar - thanks for your investigations!

I can confirm that my apt-mirror image did NOT have the package xz-utils installed but now it works WITH it!

Great job! 👍

@elezar one thing is left: now the apt command on a client cries that there is no Release file.

I see it is even missing at https://github.com/NVIDIA/libnvidia-container/tree/gh-pages/stable/deb/amd64.

From the following documentation: https://wiki.debian.org/DebianRepository/Format#Flat_Repository_Format it is unclear whether a Release file is actually required. It seems that either InRelease or Release must be specified.

Can you give more information on what apt commands you're using and what the errors are?

After an apt update i get this:

Ign:5 https://mirror-apt.local/nvidia-container-toolkit-jammy  InRelease
Ign:6 https://mirror-apt.local/nvidia-cuda-jammy  InRelease
Err:7 https://mirror-apt.local/nvidia-container-toolkit-jammy  Release
  404  Not Found [IP: 10.10.10.10 443]
Hit:8 https://mirror-apt.local/nvidia-cuda-jammy  Release
Reading package lists... Done
E: The repository 'https://mirror-apt.local/nvidia-container-toolkit-jammy  Release' does not have a Release file.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.

InRealease and Release are both getting tried. Meanwhile I found that none of them does exist in my local mirror, as in your listing above.

Does:

sudo apt-get update --allow-insecure-repositories

work as expected?

Yes it does.

The problem seems to be caused by apt-mirror, according to apt-mirror/apt-mirror#156. It seems to miss this file on flat repositories. I will look for it or an alternative next week. Thanks for your commitment!

Yes it does.

The problem seems to be caused by apt-mirror, according to apt-mirror/apt-mirror#156. It seems to miss this file on flat repositories. I will look for it or an alternative next week. Thanks for your commitment!

I think you can get by this by marking the local mirror as trusted or ensuring that the public key for our repos is also downloaded. For example, as per our documentation https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-apt:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Note that the lines effectively look like:

deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /

in this case and setting up something similar for your mirrors would be needed.