Issue pulling image with insufficient sub/guids
vsoch opened this issue · comments
I'm (fairly successfully) running different kinds of pods in usernetes, but I just hit this error:
Warning Failed 28s kubelet Failed to pull image "vanessa/pytorch-dist-test": failed to pull and unpack image "docker.io/vanessa/pytorch-dist-test:latest": failed to extract layer sha256:b51c2b01ae19fd8eccd86ab9a8667a71f5ae4f739790cd8859935405bcceca93: mount callback failed on /var/lib/containerd/tmpmounts/containerd-mount180165242: failed to Lchown "/var/lib/containerd/tmpmounts/containerd-mount180165242/opt/conda/pkgs/pytorch-1.0.0-py3.6_cuda10.0.130_cudnn7.4.1_1/bin" for UID 1185200044, GID 1185200044: lchown /var/lib/containerd/tmpmounts/containerd-mount180165242/opt/conda/pkgs/pytorch-1.0.0-py3.6_cuda10.0.130_cudnn7.4.1_1/bin: invalid argument (Hint: try increasing the number of subordinate IDs in /etc/subuid and /etc/subgid): unknown
I'm not sure the error is correct for the error, I was wondering if there are too many containers running? I figured out I could do make shell
to get into one of the nodes, and then I found a way to see containerd images:
root@u7s-lima-flux-0:/usernetes# crictl images
IMAGE TAG IMAGE ID SIZE
docker.io/flannel/flannel-cni-plugin v1.2.0 a55d1bad692b7 3.88MB
docker.io/flannel/flannel v0.22.2 49937eb983daf 27MB
docker.io/kindest/kindnetd v20230511-dc714da8 b0b1fa0f58c6e 27.7MB
docker.io/kindest/local-path-helper v20230510-486859a6 be300acfc8622 3.05MB
docker.io/kindest/local-path-provisioner v20230511-dc714da8 ce18e076e9d4b 19.4MB
registry.k8s.io/coredns/coredns v1.10.1 ead0a4a53df89 16.2MB
registry.k8s.io/etcd 3.5.9-0 73deb9a3f7025 103MB
registry.k8s.io/kube-apiserver v1.28.0 a432ea809db3e 85.8MB
registry.k8s.io/kube-apiserver v1.28.4 7fe0e6f37db33 34.7MB
registry.k8s.io/kube-controller-manager v1.28.0 df537910e4a99 81.5MB
registry.k8s.io/kube-controller-manager v1.28.4 d058aa5ab969c 33.4MB
registry.k8s.io/kube-proxy v1.28.0 b16199d508b6d 74.7MB
registry.k8s.io/kube-proxy v1.28.4 83f6cc407eed8 24.6MB
registry.k8s.io/kube-scheduler v1.28.0 553617289d9f1 61.5MB
registry.k8s.io/kube-scheduler v1.28.4 e3db313c6dbc0 18.8MB
registry.k8s.io/pause 3.7 221177c6082a8 311kB
registry.k8s.io/pause 3.9 e6f1816883972 322kB
root@u7s-lima-flux-0:/usernetes# crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
7c707a589ca66 ead0a4a53df89 4 hours ago Running coredns 0 aed90e90c6242 coredns-5dd5756b68-9kl5h
9776db8e8b544 ead0a4a53df89 4 hours ago Running coredns 0 4a0eeeb48b889 coredns-5dd5756b68-qnbdl
d5fba5d8745e9 49937eb983daf 4 hours ago Running kube-flannel 0 2a4085ee6ee10 kube-flannel-ds-77kwr
2eb73b78f1728 83f6cc407eed8 4 hours ago Running kube-proxy 0 bfc90a6ccd0cc kube-proxy-czg44
f5f8eb5441fdd 73deb9a3f7025 4 hours ago Running etcd 0 4c14985778844 etcd-u7s-lima-flux-0
eeb8ff772a280 7fe0e6f37db33 4 hours ago Running kube-apiserver 0 3ec15fdc4a314 kube-apiserver-u7s-lima-flux-0
9521080e367b7 d058aa5ab969c 4 hours ago Running kube-controller-manager 0 c76cf2b2f0b3a kube-controller-manager-u7s-lima-flux-0
9515db7b8fb1c e3db313c6dbc0 4 hours ago Running kube-scheduler 0 54bedc845ed7a kube-scheduler-u7s-lima-flux-0
These just look like images for the kubelet or control plane (not any applications) and interesting, there aren't any subuid in the file here:
# cat /etc/subuid
# cat /etc/subgid
# both empty files
Is there a bug here / something we can do to get it to work?
Also this seems to be an issue with creating (maybe?) more pods than the size of my resources (cpu, etc) can support. I deployed a smaller pytorch workflow ref and it worked! @AkihiroSuda this is SO cool it's rocking my socks!! 🧦 This is what we wanted to get working many months ago and I'm over the moon it's starting to! 🌔
These just look like images for the kubelet or control plane (not any applications) and interesting, there aren't any subuid in the file here:
Please check the files on the host.
You probably have 65536 ids there.
Yes the uid/gid for the host virtual machine (not inside of docker compose) looks OK.
Please try increasing 65536 there to a larger number
Please try increasing 65536 there to a larger number
Sure! I've never done that on my host. How large should it be?
Depends, but at least 1185200044 for your image
for UID 1185200044, GID 1185200044: lchown /var/lib/containerd/tmpmounts/containerd-mount180165242/opt/conda/pkgs/pytorch-1.0.0-py3.6_cuda10.0.130_cudnn7.4.1_1/bin: invalid argument (Hint: try increasing the number of subordinate IDs in /etc/subuid and /etc/subgid)