Failure to launch pods in EKS (containerd)

Question

Failure to launch pods in EKS (containerd)

flaminidavid opened this issue 3 months ago · comments

Looks like the launcher fails to run Docker due to missing privileges. Oddly enough the capabilities as reported by containerd look fine. $UID says 0.

Originally tried a few clabernetes versions (as I was suspecting some capabilities issue) and legacy iptables, no luck. I'm attaching a debug session showing where it fails and the capabilities applied to the containers. Any pointers appreciated.

docker.txt
node.txt

Carl Montanari · Answer 1 · Wed May 01 2024 07:31:48 GMT+0800 (China Standard Time)

👋

hey there! can you confirm that the deployment is privileged? if it is (which should be the default ~~if you are on latest/greatest~~ nvm I can read. also I dont remember why 0.0.30 is a pre release, but guess that shouldn't matter!) then I suspect I have nothing useful to add!

I dont have any eks access so not sure what else would be funky there but assuming you have a privileged pod you "should" be able to do whatever of course.

generally I'd try to kill the controller (so it lets you mess about without re-reconciling -- also there is a label you can add to topology to tell controller to not reconcile: clabernetes/ignoreReconcile) and then change entry point of launcher to sleep infinity and exec on and play around and see if anything else interesting pops up

flaminidavid · Answer 2 · Wed May 01 2024 22:38:53 GMT+0800 (China Standard Time)

Hey, thanks! Got the privilege issue sorted (basically --profile=netadmin), but it still fails exporting the image. I tried changing the discard_unpacked_layers setting on the node but nothing changed. Looks like it finds the image after the puller downloads it but cannot move it over. Is there any way to mount the pull secret on the launcher? (it works just fine if I do it manually).

INFO |               clabernetes | image "IMAGE" is present, begin copy to docker daemon...
INFO |               clabernetes | time="2024-05-01T14:32:12Z" level=fatal 
msg="failed to resolve reference \"IMAGE\": unexpected status from HEAD request 
to https://REGISTRY: 401 Unauthorized"

WARN |               clabernetes | image re-pull failed, this can happen when containerd 
sets `discard_unpacked_layers` sets to true and we don't have appropriate pull 
secrets for pulling the image. will continue attempting image pull through but 
this may fail, error: exit status 1

Roman Dodin · Answer 3 · Wed May 01 2024 22:44:02 GMT+0800 (China Standard Time)

Oh this is what gives us pain...

But before we get to the image pull, can you elaborate for my understanding what was the issue with the privileged and profiles?
I would like to document this

As for the image move we have been trying to chat with various people from cloud providers to have an option to allow copying the image that was pulled to a worker, but it seems unless a cloud provider lets you configure the containerd runtime, there is not much we can do...

At the same time, the image "move" albeit being super annoying shouldn't stop the image from being pulled into a launcher, the launcher will have to pull it again, instead of copying it from the worker.

Carl Montanari · Answer 4 · Wed May 01 2024 22:49:59 GMT+0800 (China Standard Time)

--profile=netadmin

as in kubectl flag or something else? the deployment the manager spawns should be privileged by default so there "shouldn't" be any other things you need to do unless there is some aws/eks specific thing you had to do (and if so please elaborate more for future me/folks :))

Is there any way to mount the pull secret on the launcher

not exactly, no. you can mount a docker config which could include whatever auth you want. otherwise this setup is intentional so that we never have pull secrets on the launcher and so that we can use the host nodes to cache the images. this whole layer nonsense really blows the spot though. but... like roman said this shouldn't stop things though (assuming image is publicly pullable), it will just stop us from using the host nodes as the cache/letting k8s pull stuff work, the launcher will then just fail back to pulling the image directly (like from its docker daemon I mean).

if you re looking for the docker daemon config you set it like spec.imagePull.dockerDaemonConfig and it points to some secret in the namespace of your topology then we just mount it for you.

flaminidavid · Answer 5 · Thu May 02 2024 02:33:12 GMT+0800 (China Standard Time)

quickly replying here:

Privileged profile: I had been fiddling with pod security standards as per https://kubernetes.io/docs/concepts/security/pod-security-standards/. I ended up applying the privileged profile to the clab namespace. Also when debugging with kubectl debug it needs passing the profile, e.g., kubectl debug -n c9s-blah --profile= [ sysadmin | netadmin ].

Pull secrets: The registry is private (has to be in this case). The puller works fine with the pull secrets provided, it's the moving over to Docker from the launcher that fails. I found spec.imagePull.dockerDaemonConfig by looking at the code, the issue with passing Docker daemon config is that there's nowhere to provide auth creds in there (that I know of).

Carl Montanari · Answer 6 · Thu May 02 2024 05:47:03 GMT+0800 (China Standard Time)

re priv -- ok, so you had to set pss stuff which probably had some tighter stuff by default for eks I assume; but after this, the launcher/docker should run fine right? (assuming this since the pod should be privileged)

yeah the discard_unpacked_layers thing is the bane of our existence. you mentioned you tried to set it but it still fails? it "shouldn't" if that is set nicely. do you manage the nodes? did containerd get restarted and the image re-pulled? I imagine if the image was present even if it was all set nicely the image would have to get re-pulled so it has all the layers.

regarding secrets/daemon config: blah, yeah probably this is my bad since I was thinking you could stuff auth stuff in daemon config but probably never tested since I just use pull through thing anyway. I can add spec.imagePull.dockerConfig and have it be basically same setup as the daemon config -- then you can set auth stuff that way. in the meantime could you try to pop on the launcher and do a docker login and pull your image just to make sure all that part works?

thanks!

flaminidavid · Answer 7 · Thu May 02 2024 23:13:45 GMT+0800 (China Standard Time)

Privileged profile: yes, that's correct. All working fine now on that front.

discard_unpacked_layers: correct, I tried but still failed with the same message. Restarted containerd and all. Deleted the images so they would be pulled clean again. Maybe I messed something else up.

spec.imagePull.dockerConfig: this would be quite nice to have. I worked around it by spawning a local (insecure) registry inside the c9s-blah namespace and using my own puller.

Now I'm facing a different problem when the launcher tries to bring up VXLAN tunnels. I think what's happening is that if the payload container (i.e., the node that the launcher spins up) fails the first time the launcher doesn't clean up after itself and vx-... interfaces are left behind. This causes it to fail the second time with file exists errors when it tries to create the same interface (or some dependency).

I deleted the leftover interface before the launcher got to that step (e.g., ip link del clab-ee8bc3a2, see below) and it sort of worked (VXLAN setup didn't fail) but the node still exited after booting (there was connectivity between the nodes for a brief moment). There were some bridge interfaces left behind that I didn't try to delete. I ended up deleting the pod via kubectl so it would be recreated from scratch, this worked. I'll keep testing, this was just a 2-node setup. I suppose we need a different issue for this new problem anyway.

10: clab-ee8bc3a2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether aa:c1:ab:e9:d7:a8 brd ff:ff:ff:ff:ff:ff
    alias vx-blah-eth1

Roman Dodin · Answer 8 · Thu May 02 2024 23:22:58 GMT+0800 (China Standard Time)

Interesting, thanks for a detailed report. I never saw anything remotely similar to what you're having, but maybe this was because I hadn't had topologies that were not run to completion.

Typically what I do when a topo is all messed up is I delete the whole namespace where the topology was deployed. That is, I deploy my topo in its own namespace, and that makes all c9s resources to be namespaced.
So when things go south, I remove the whole namespace and try again. Maybe a useless advice, but something I'd try first =)

Carl Montanari · Answer 9 · Thu May 02 2024 23:30:50 GMT+0800 (China Standard Time)

Privileged profile: yes, that's correct. All working fine now on that front.

nice, copy, thanks!

discard_unpacked_layers

continues to be the bane of our existence, and now yours too maybe 🤣

spec.imagePull.dockerConfig

cool, will try to work that up over the next week or so. not hard, just need some time to do it! if you are up for a pr that would be cool too -- basically would just follow same pattern as the daemon config, so shouldn't be too terrible.

im not 100% im tracking on the remaining, but if/when the launcher fails that container should restart so I dont think there should be any cruft left. if you are talking about the services (vx- blah) then those should be reconciled every time the topology is reconciled (and deleted via owner ref when topo is deleted) so those shouldn't have cruft either.

I guess the general thing here is: the launcher should crash if it cant continue, and that crash should let us have a new container w/ fresh stuff. if thats not the case we can fix that but I think/hope that should work. by deleting the pod you are doing the same thing that the launcher would do when it crashes so I think/hope that will be ok. maybe the pod didn't restart because you changed entry point or something while troubleshooting?

but yeah if you can document and get another issue rolling that would be cool! then we can keep this open to track the docker config bits! thanks @flaminidavid !

flaminidavid · Answer 10 · Tue May 07 2024 16:26:54 GMT+0800 (China Standard Time)

I'm running into some more EKS fun given the particularities of my setup. I'll report back when those are sorted (and open another issue for the interface cleanup on failure, it's still happening, although I believe the container failures are caused by something else). Thank you!

Carl Montanari · Answer 11 · Sun May 12 2024 03:06:53 GMT+0800 (China Standard Time)

@flaminidavid fyi the docker config stuff is in v0.1.0 now -- so that should be one less problem for ya at least 🙃

flaminidavid · Answer 12 · Sat May 25 2024 19:18:37 GMT+0800 (China Standard Time)

Thanks a million. I got my setup to work consistently in EKS, still using a local registry for now. The last tranche of launcher failures was caused by a background process that identified some router container processes as a threat (can't talk specifics here). Just in case anyone else runs into this and rabbit holes debugging containers booting, container processes, and VXLAN setup: it wasn't any of that. The cleanup probably works well in any other situation. I think we can close this one.

Roman Dodin · Answer 13 · Sat May 25 2024 19:28:40 GMT+0800 (China Standard Time)

Good to know @flaminidavid
since you had a good run with EKS do you think we can go as far as say that c9s runs on EKS without manual tinkering considering it is a clean EKS deployment and priv. pods can run there?

flaminidavid · Answer 14 · Sat May 25 2024 19:41:22 GMT+0800 (China Standard Time)

I'd say so, yes.

Aside from the discard_unpacked_layers issue that we should be able to work around by passing Docker config everything else seems to work fine in general. "Just make sure nothing else is messing with the cluster" would be my advice 🙃.

Roman Dodin · Answer 15 · Sat May 25 2024 19:53:29 GMT+0800 (China Standard Time)

sweet, ty.

I will have to try onboard a simple Nokia SR Linux topology there to make sure I capture the docs steps for others to have a smoother, documented experience.

I will close this now.

flaminidavid · Answer 16 · Tue Jun 25 2024 00:16:49 GMT+0800 (China Standard Time)

Just confirming: overriding Docker config via spec.imagePull.dockerConfig worked just fine (and switched back pullThroughOverride from never to auto). This is working with AWS ECR, too.