containerd / nerdctl

contaiNERD CTL - Docker-compatible CLI for containerd, with support for Compose, Rootless, eStargz, OCIcrypt, IPFS, ...

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`-test.kill-daemon` does not work on Ubuntu 24.04 (`Failed to kill unit containerd.service: Failed to send signal SIGKILL to auxiliary processes: Invalid argument\n`)

AkihiroSuda opened this issue · comments

func GetDaemonIsKillable() bool {
if flagTestKillDaemon && strings.HasPrefix(infoutil.DistroName(), "Ubuntu 24.04") { // FIXME: check systemd version, not distro
log.L.Warn("FIXME: Ignoring -test.kill-daemon: the flag does not seem to work on Ubuntu 24.04")
// > Failed to kill unit containerd.service: Failed to send signal SIGKILL to auxiliary processes: Invalid argument\n
// https://github.com/containerd/nerdctl/pull/3129#issuecomment-2185780506
return false
}
return flagTestKillDaemon
}

This is intermittent. systemctl kill -s KILL containerd will work most of the time, and sometimes fail - pretty much whether or not there are other processes in the process group.

FWIW: https://github.com/systemd/systemd/blob/main/src/core/unit.c#L4083

We do get EINVAL, probably from cg_kill_recursive, which implementation seems to have changed quite a bit between 22.04 (systemd 249) and 24.04 (systemd 255).

That being said, here is the log from 249 (working):

Jul 04 17:41:20 lima-dock systemd[1]: containerd.service: Sent signal SIGKILL to main process 1106603 (containerd) on client request.
Jul 04 17:41:20 lima-dock systemd[1]: containerd.service: Sending signal SIGKILL to process 1106604 (containerd) on client request.
Jul 04 17:41:20 lima-dock systemd[1]: containerd.service: Sending signal SIGKILL to process 1106605 (containerd) on client request.
Jul 04 17:41:20 lima-dock systemd[1]: containerd.service: Sending signal SIGKILL to process 1106607 (containerd) on client request.
Jul 04 17:41:20 lima-dock systemd[1]: containerd.service: Sending signal SIGKILL to process 1106608 (n/a) on client request.
Jul 04 17:41:20 lima-dock systemd[1]: containerd.service: Sending signal SIGKILL to process 1106609 (n/a) on client request.
Jul 04 17:41:20 lima-dock systemd[1]: containerd.service: Sending signal SIGKILL to process 1106610 (containerd) on client request.
Jul 04 17:41:20 lima-dock systemd[1]: containerd.service: Sending signal SIGKILL to process 1106612 (containerd) on client request.
Jul 04 17:41:20 lima-dock systemd[1]: containerd.service: Sending signal SIGKILL to process 1106613 (n/a) on client request.
Jul 04 17:41:20 lima-dock systemd[1]: containerd.service: Sending signal SIGKILL to process 1106614 (containerd) on client request.
Jul 04 17:41:21 lima-dock systemd[1]: containerd.service: Child 1106603 belongs to containerd.service.
Jul 04 17:41:21 lima-dock systemd[1]: containerd.service: Main process exited, code=killed, status=9/KILL
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ An ExecStart= process belonging to unit containerd.service has exited.
░░
░░ The process' exit code is 'killed' and its exit status is 9.
Jul 04 17:41:21 lima-dock systemd[1]: containerd.service: Failed with result 'signal'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ The unit containerd.service has entered the 'failed' state with result 'signal'.
Jul 04 17:41:21 lima-dock systemd[1]: containerd.service: Service will restart (restart setting)
Jul 04 17:41:21 lima-dock systemd[1]: containerd.service: Changed running -> failed
Jul 04 17:41:21 lima-dock systemd[1]: containerd.service: Unit entered failed state.
Jul 04 17:41:21 lima-dock systemd[1]: containerd.service: Consumed 101ms CPU time.

Here is the log for 255 (failing):

Failed to kill unit containerd.service: Failed to send signal SIGKILL to auxiliary processes: Invalid argument

Jul 04 17:41:25 lima-24 systemd[1]: containerd.service: Sent signal SIGKILL to main process 51625 (containerd) on client request.
Jul 04 17:41:25 lima-24 systemd[1]: containerd.service: Failed to send signal SIGKILL to auxiliary processes on client request: Invalid argument
Jul 04 17:41:25 lima-24 systemd[1]: containerd.service: Child 51625 belongs to containerd.service.
Jul 04 17:41:25 lima-24 systemd[1]: containerd.service: Main process exited, code=killed, status=9/KILL
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ An ExecStart= process belonging to unit containerd.service has exited.
░░
░░ The process' exit code is 'killed' and its exit status is 9.
Jul 04 17:41:25 lima-24 systemd[1]: containerd.service: Failed with result 'signal'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ The unit containerd.service has entered the 'failed' state with result 'signal'.
Jul 04 17:41:25 lima-24 systemd[1]: containerd.service: Service will restart (restart setting)
Jul 04 17:41:25 lima-24 systemd[1]: containerd.service: Changed running -> failed-before-auto-restart
Jul 04 17:41:25 lima-24 systemd[1]: containerd.service: Unit entered failed state.
Jul 04 17:41:25 lima-24 systemd[1]: containerd.service: Consumed 150ms CPU time.

In both cases, the main process DOES get killed.

It does look to me like in both cases systemctl is ignoring KillMode=process and tries to kill children in the group (which fails in a racy way maybe? because containerd already took care of that?) - maybe because it does not recognize -s KILL as the right KillSignal (which is SIGTERM by default). Obviously was not a hard error with 249, but now is with 255.

Anyhow, here is my 2 cents:

  • I think the message is a redhering, and even if the kill command exits with an error, it actually worked as expected for us (it did kill containerd) - it seems to me we are safe just ignoring the error
  • instead of calling kill -s KILL we should just be able to call kill - which will work properly, without error, as systemd will now honor KillMode=Process instead of face-planting trying to kill children that are no longer there. Obviously, just kill will first send SIGTERM, then SIGKILL after that, so, it is not completely the same behavior.

What do you think?

Am I missing something here?

Sending a first tentative PR removing the explicit signal. Let see if things are happy.

If not, we can move to plan b: ignore the error.

And finally there is always plan c: get containerd pid, and kill manually, bypassing systemctl entirely.