Issue pulling image with insufficient sub/guids

Question

Issue pulling image with insufficient sub/guids

vsoch opened this issue 7 months ago · comments

I'm (fairly successfully) running different kinds of pods in usernetes, but I just hit this error:

  Warning  Failed     28s                 kubelet            Failed to pull image "vanessa/pytorch-dist-test": failed to pull and unpack image "docker.io/vanessa/pytorch-dist-test:latest": failed to extract layer sha256:b51c2b01ae19fd8eccd86ab9a8667a71f5ae4f739790cd8859935405bcceca93: mount callback failed on /var/lib/containerd/tmpmounts/containerd-mount180165242: failed to Lchown "/var/lib/containerd/tmpmounts/containerd-mount180165242/opt/conda/pkgs/pytorch-1.0.0-py3.6_cuda10.0.130_cudnn7.4.1_1/bin" for UID 1185200044, GID 1185200044: lchown /var/lib/containerd/tmpmounts/containerd-mount180165242/opt/conda/pkgs/pytorch-1.0.0-py3.6_cuda10.0.130_cudnn7.4.1_1/bin: invalid argument (Hint: try increasing the number of subordinate IDs in /etc/subuid and /etc/subgid): unknown

I'm not sure the error is correct for the error, I was wondering if there are too many containers running? I figured out I could do make shell to get into one of the nodes, and then I found a way to see containerd images:

root@u7s-lima-flux-0:/usernetes# crictl  images
IMAGE                                      TAG                  IMAGE ID            SIZE
docker.io/flannel/flannel-cni-plugin       v1.2.0               a55d1bad692b7       3.88MB
docker.io/flannel/flannel                  v0.22.2              49937eb983daf       27MB
docker.io/kindest/kindnetd                 v20230511-dc714da8   b0b1fa0f58c6e       27.7MB
docker.io/kindest/local-path-helper        v20230510-486859a6   be300acfc8622       3.05MB
docker.io/kindest/local-path-provisioner   v20230511-dc714da8   ce18e076e9d4b       19.4MB
registry.k8s.io/coredns/coredns            v1.10.1              ead0a4a53df89       16.2MB
registry.k8s.io/etcd                       3.5.9-0              73deb9a3f7025       103MB
registry.k8s.io/kube-apiserver             v1.28.0              a432ea809db3e       85.8MB
registry.k8s.io/kube-apiserver             v1.28.4              7fe0e6f37db33       34.7MB
registry.k8s.io/kube-controller-manager    v1.28.0              df537910e4a99       81.5MB
registry.k8s.io/kube-controller-manager    v1.28.4              d058aa5ab969c       33.4MB
registry.k8s.io/kube-proxy                 v1.28.0              b16199d508b6d       74.7MB
registry.k8s.io/kube-proxy                 v1.28.4              83f6cc407eed8       24.6MB
registry.k8s.io/kube-scheduler             v1.28.0              553617289d9f1       61.5MB
registry.k8s.io/kube-scheduler             v1.28.4              e3db313c6dbc0       18.8MB
registry.k8s.io/pause                      3.7                  221177c6082a8       311kB
registry.k8s.io/pause                      3.9                  e6f1816883972       322kB
root@u7s-lima-flux-0:/usernetes# crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                      ATTEMPT             POD ID              POD
7c707a589ca66       ead0a4a53df89       4 hours ago         Running             coredns                   0                   aed90e90c6242       coredns-5dd5756b68-9kl5h
9776db8e8b544       ead0a4a53df89       4 hours ago         Running             coredns                   0                   4a0eeeb48b889       coredns-5dd5756b68-qnbdl
d5fba5d8745e9       49937eb983daf       4 hours ago         Running             kube-flannel              0                   2a4085ee6ee10       kube-flannel-ds-77kwr
2eb73b78f1728       83f6cc407eed8       4 hours ago         Running             kube-proxy                0                   bfc90a6ccd0cc       kube-proxy-czg44
f5f8eb5441fdd       73deb9a3f7025       4 hours ago         Running             etcd                      0                   4c14985778844       etcd-u7s-lima-flux-0
eeb8ff772a280       7fe0e6f37db33       4 hours ago         Running             kube-apiserver            0                   3ec15fdc4a314       kube-apiserver-u7s-lima-flux-0
9521080e367b7       d058aa5ab969c       4 hours ago         Running             kube-controller-manager   0                   c76cf2b2f0b3a       kube-controller-manager-u7s-lima-flux-0
9515db7b8fb1c       e3db313c6dbc0       4 hours ago         Running             kube-scheduler            0                   54bedc845ed7a       kube-scheduler-u7s-lima-flux-0

These just look like images for the kubelet or control plane (not any applications) and interesting, there aren't any subuid in the file here:

# cat /etc/subuid 
# cat /etc/subgid
# both empty files

Is there a bug here / something we can do to get it to work?

Vanessasaurus · Answer 1 · Fri Nov 24 2023 06:26:39 GMT+0800 (China Standard Time)

Also this seems to be an issue with creating (maybe?) more pods than the size of my resources (cpu, etc) can support. I deployed a smaller pytorch workflow ref and it worked! @AkihiroSuda this is SO cool it's rocking my socks!! 🧦 This is what we wanted to get working many months ago and I'm over the moon it's starting to! 🌔

Akihiro Suda · Answer 2 · Fri Nov 24 2023 08:30:51 GMT+0800 (China Standard Time)

These just look like images for the kubelet or control plane (not any applications) and interesting, there aren't any subuid in the file here:

Please check the files on the host.
You probably have 65536 ids there.

Vanessasaurus · Answer 3 · Fri Nov 24 2023 08:33:59 GMT+0800 (China Standard Time)

Yes the uid/gid for the host virtual machine (not inside of docker compose) looks OK.

Akihiro Suda · Answer 4 · Fri Nov 24 2023 08:37:52 GMT+0800 (China Standard Time)

Please try increasing 65536 there to a larger number

Vanessasaurus · Answer 5 · Fri Nov 24 2023 08:42:44 GMT+0800 (China Standard Time)

Please try increasing 65536 there to a larger number

Sure! I've never done that on my host. How large should it be?

Akihiro Suda · Answer 6 · Fri Nov 24 2023 08:46:02 GMT+0800 (China Standard Time)

Depends, but at least 1185200044 for your image

for UID 1185200044, GID 1185200044: lchown /var/lib/containerd/tmpmounts/containerd-mount180165242/opt/conda/pkgs/pytorch-1.0.0-py3.6_cuda10.0.130_cudnn7.4.1_1/bin: invalid argument (Hint: try increasing the number of subordinate IDs in /etc/subuid and /etc/subgid)