runsc doesn't work with rootless podman
sdeoras opened this issue · comments
I am trying to evaluate the use of gVisor
via [podman](https://github.com/containers/libpod)
that allows container creation in rootless mode. gVisor
works fine via sudo
but panics when in rootless mode. Below is stack trace and other relevant info.
system info:
uname -a
Linux 4.18.0-21-generic #22~18.04.1-Ubuntu SMP Thu May 16 15:07:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
runsc --version
runsc version 90a116890fce
spec: 1.0.1-dev
permissions on runsc
ls -la `which runsc`
-rwxr-xr-x 1 root root 20123510 Jun 4 01:20 /usr/bin/runsc
podman --version
podman version 1.3.2-dev
works fine when sudo
sudo podman --runtime=runsc run --rm -it docker.io/library/ubuntu:latest bash
runc works fine in both root and rootless modes
sudo podman --runtime=runc run --rm -it docker.io/library/ubuntu:latest bash
podman --runtime=runc run --rm -it docker.io/library/ubuntu:latest bash
panics when running in podman/rootless mode
stack trace:
podman --runtime=runsc run --rm -it docker.io/library/ubuntu:latest bash
I0604 20:47:56.621539 21217 x:0] ***************************
I0604 20:47:56.621611 21217 x:0] Args: [/usr/bin/runsc start fb6738612f208a2786470ab33803763b976290531168c3716ea72b30ae74f310]
I0604 20:47:56.621673 21217 x:0] Version 90a116890fce
I0604 20:47:56.621686 21217 x:0] PID: 21217
I0604 20:47:56.621698 21217 x:0] UID: 0, GID: 0
I0604 20:47:56.621706 21217 x:0] Configuration:
I0604 20:47:56.621712 21217 x:0] RootDir: /run/user/1000/runsc
I0604 20:47:56.621720 21217 x:0] Platform: ptrace
I0604 20:47:56.621734 21217 x:0] FileAccess: exclusive, overlay: false
I0604 20:47:56.621744 21217 x:0] Network: sandbox, logging: false
I0604 20:47:56.621755 21217 x:0] Strace: false, max size: 1024, syscalls: []
I0604 20:47:56.621762 21217 x:0] ***************************
I0604 20:47:56.625479 21217 x:0] Setting up network
I0604 20:47:56.625961 21217 x:0] Applying namespace network at path "/proc/21187/ns/net"
I0604 20:47:56.626170 21217 x:0] Skipping down interface: {Index:1 MTU:65536 Name:lo HardwareAddr: Flags:loopback}
W0604 20:47:56.626272 21217 x:0] IPv6 is not supported, skipping: fe80::40b5:4cff:fe3c:9d9/64
W0604 20:47:56.649254 21217 x:0] IPv6 is not supported, skipping route: {Ifindex: 2 Dst: fe80::/64 Src: <nil> Gw: <nil> Flags: [] Table: 254}
I0604 20:47:56.649925 21217 x:0] Restoring namespace network
panic: error restoring namespace: of type network: operation not permitted
goroutine 1 [running, locked to thread]:
gvisor.googlesource.com/gvisor/runsc/specutils.ApplyNS.func1()
runsc/specutils/namespace.go:146 +0x29d
gvisor.googlesource.com/gvisor/runsc/sandbox.joinNetNS.func1()
runsc/sandbox/network.go:119 +0x24
gvisor.googlesource.com/gvisor/runsc/sandbox.createInterfacesAndRoutesFromNS(0xc00019eb60, 0xc0001d6160, 0x12, 0xc0001d6101, 0xe27480, 0xc00019ec40)
runsc/sandbox/network.go:274 +0x10d0
gvisor.googlesource.com/gvisor/runsc/sandbox.setupNetwork(0xc00019eb60, 0x52c3, 0xc000097420, 0xc000178000, 0x2, 0xc000068080)
runsc/sandbox/network.go:71 +0x380
gvisor.googlesource.com/gvisor/runsc/sandbox.(*Sandbox).StartRoot(0xc0001558c0, 0xc000097420, 0xc000178000, 0x0, 0x0)
runsc/sandbox/sandbox.go:139 +0x192
gvisor.googlesource.com/gvisor/runsc/container.(*Container).Start(0xc0000d23c0, 0xc000178000, 0x0, 0x0)
runsc/container/container.go:397 +0x288
gvisor.googlesource.com/gvisor/runsc/cmd.(*Start).Execute(0x14e48c0, 0xe38480, 0xc000044008, 0xc0001684e0, 0xc000136780, 0x2, 0x2, 0x7fcc5f2b4008)
runsc/cmd/start.go:61 +0x139
github.com/google/subcommands.(*Commander).Execute(0xc000096000, 0xe38480, 0xc000044008, 0xc000136780, 0x2, 0x2, 0x13)
external/com_github_google_subcommands/subcommands.go:141 +0x2fb
github.com/google/subcommands.Execute(...)
external/com_github_google_subcommands/subcommands.go:371
main.main()
runsc/main.go:245 +0x1452
Killed
I don't have experience with podman, so I don't really know what podman rootless is doing. The code above joined the network namespace to configure the network and it's trying to restore back to the original namespace, which should be allowed. Not sure why it's failing.
Having said that, runsc requires the caller to be root right now. It would be nice to make runsc work rootless under a flag, especially for runsc do
, but we don't have immediate plans to do that. If you are interested in working on it, I can help you getting started...
@avagin has poked around this recently.
I don't think we require the caller to be root, we just create new namespaces by default.
I think there's an explicit test for this behavior with runsc do
:
https://github.com/google/gvisor/blob/master/tools/run_tests.sh#L212
Maybe there would be a way to detect sufficient namespaces that we skip it? Versus having to pass --netns=none.
@fvoznika, thanks for taking a look at the bug. If there is nothing fundamentally blocking runsc from running in rootless mode I would be interested in helping out resolving this bug. Let me spend some time with the code so I have relevant questions to ask you. Thanks again!
Ideally we would be able to run runsc --rootless <cmd>
without the need to be root or unshare. However, this is difficult because many of the defense in depth steps that runsc
takes require CAP_SYS_ADMIN
.
In summary, runsc enters/creates namespaces, maps user/groups in the new namespace, calls pivot_root and chroot, and mounts /proc
inside the new roots. Many of these operations require CAP_SYS_ADMIN
, which unshare -Ur
solves. However, mounting /proc
requires being the root, not only root inside a user namespace. This is why the test @amscanne pointed to is using --TESTONLY-unsafe-nonroot
. This flag makes the sandbox run as the same user that called runsc create
(in this case, root inside the user namespace unshare
created), and will not chroot the sandbox process to an empty directory.
I think we can remove /proc
usage so that --TESTONLY-unsafe-nonroot
is not needed anymore, only CAP_SYS_ADMIN
, CAP_SYS_CHROOT
, etc that are acquired via unshare. I'm not sure what kind of configuration controls you have with podman. Is there a way to configure it (or add a plugin) that will create a new user namespace and execute runsc
as root inside this namespace? If not, maybe creating a wrapper script that intercepts calls to runsc create ...
and calls unshare -Ur runsc create ...
?
@prattmic runsc create requires to set uid and gid mappings which has to be set via newuidmap.
@giuseppe PTAL
Podman already does the user namespace setup and configuration.
AFAICS, the issue seems to be in the function https://github.com/google/gvisor/blob/master/runsc/specutils/namespace.go#L143-L149
oldNS
points to a namespace not owned by the rootless user (in this case the network namespace on the host), so gVisor fails to re-join it when running in a user namespace.
A possible solution is to let the code run in a goroutine and on error, keep the OS thread locked, so that the Go Runtime will destroy the underlying thread when the go routine ends. From: https://golang.org/pkg/runtime/#LockOSThread
LockOSThread wires the calling goroutine to its current operating system thread.
The calling goroutine will always execute in that thread, and no other goroutine will
execute in it, until the calling goroutine has made as many calls to UnlockOSThread
as to LockOSThread. If the calling goroutine exits without unlocking the thread, the
thread will be terminated.
so there is no risk another goroutine will run in the wrong namespace
Podman already does the user namespace setup and configuration.
yeah, rootless here is not the same rootless that we think about. podman creates a user namespace, sets user and group mappings, and executes gvisor there under the root user with all capabilities.
The idea with LockOSThread is good, but we fork gofer and sandbox processes with pdeathsig and it means that they die when their parent thread exits. We can block the current system thread if one of namespaces can't be restored.
With the following changes, I was able to start a podman rootless container: avagin@db868af
I used this wrapper for runsc to set custom options:
$ cat /usr/local/bin/runsc-podman
#!/bin/bash
/usr/local/bin/runsc --network host --ignore-cgroups --debug --debug-log '/tmp/runsc/runsc.log.%TEST%.%TIMESTAMP%.%COMMAND%' "$@"
And now, we are ready to run a container:
$ podman --runtime /usr/local/bin/runsc-podman run --security-opt=label=disable docker.io/library/busybox echo Hello, World
Hello, World
The idea with LockOSThread is good, but we fork gofer and sandbox processes with pdeathsig and it means that they die when their parent thread exits. We can block the current system thread if one of namespaces can't be restored.
Should the thread be locked in any case? I had to troubleshoot a similar error in the past: containers/storage#530
It turned out that the Go runtime can terminate threads at its will without any way of controlling it from the application (at least I didn't find it).
It turned out that the Go runtime can terminate threads at its will without any way of controlling it from the application (at least I didn't find it).
I have never seen that goruntime destroys system threads, except a case when goroutine locked to a system thread exits.
The wrapper provided in #311 (comment) worked for me to use runsc in rootless podman, but it's broken again recently (in 20230320.0
and also the version before it, it worked in 2 versions before that one). I'm getting this from runsc's debug log:
$ cat /tmp/runsc/runsc.log..20230323-101913.399926.create
I0323 10:19:13.400219 108938 main.go:222] ***************************
I0323 10:19:13.400376 108938 main.go:223] Args: [/usr/bin/runsc --network host --ignore-cgroups --debug-log /tmp/runsc/runsc.log.%TEST%.%TIMESTAMP%.%COMMAND% --systemd-cgroup create --bundle /home/fishy/.local/share/containers/storage/overlay-containers/71b85f92c1756e2f6e10da0ef005dbfb8584164a52e2c694ae1c051f678547f7/userdata --pid-file /run/user/1000/containers/overlay-containers/71b85f92c1756e2f6e10da0ef005dbfb8584164a52e2c694ae1c051f678547f7/userdata/pidfile 71b85f92c1756e2f6e10da0ef005dbfb8584164a52e2c694ae1c051f678547f7]
I0323 10:19:13.400483 108938 main.go:224] Version release-20230320.0
I0323 10:19:13.400544 108938 main.go:225] GOOS: linux
I0323 10:19:13.400603 108938 main.go:226] GOARCH: amd64
I0323 10:19:13.400664 108938 main.go:227] PID: 108938
I0323 10:19:13.400728 108938 main.go:228] UID: 0, GID: 0
I0323 10:19:13.400789 108938 main.go:229] Configuration:
I0323 10:19:13.400848 108938 main.go:230] RootDir: /run/user/1000/runsc
I0323 10:19:13.400908 108938 main.go:231] Platform: ptrace
I0323 10:19:13.400967 108938 main.go:232] FileAccess: exclusive
I0323 10:19:13.401031 108938 main.go:233] Directfs: false
I0323 10:19:13.401091 108938 main.go:235] Overlay: Root=true, SubMounts=false, Medium="self"
I0323 10:19:13.401153 108938 main.go:236] Network: host, logging: false
I0323 10:19:13.401217 108938 main.go:237] Strace: false, max size: 1024, syscalls:
I0323 10:19:13.401277 108938 main.go:238] IOURING: false
I0323 10:19:13.401337 108938 main.go:239] Debug: false
I0323 10:19:13.401397 108938 main.go:240] Systemd: true
I0323 10:19:13.401456 108938 main.go:241] ***************************
W0323 10:19:13.404457 108938 specutils.go:123] noNewPrivileges ignored. PR_SET_NO_NEW_PRIVS is assumed to always be set.
I0323 10:19:13.406269 108938 namespace.go:217] Mapping host uid 1 to container uid 0 (size=1000)
I0323 10:19:13.406314 108938 namespace.go:217] Mapping host uid 0 to container uid 1000 (size=1)
I0323 10:19:13.406337 108938 namespace.go:217] Mapping host uid 1001 to container uid 1001 (size=64536)
I0323 10:19:13.406356 108938 namespace.go:225] Mapping host gid 1 to container gid 0 (size=1000)
I0323 10:19:13.406375 108938 namespace.go:225] Mapping host gid 0 to container gid 1000 (size=1)
I0323 10:19:13.406394 108938 namespace.go:225] Mapping host gid 1001 to container gid 1001 (size=64536)
I0323 10:19:13.410801 108938 container.go:1241] Gofer started, PID: 108945
I0323 10:19:13.411928 108938 sandbox.go:684] Control socket: ""
I0323 10:19:13.412063 108938 sandbox.go:720] Sandbox will be started in new mount, IPC and UTS namespaces
I0323 10:19:13.412105 108938 sandbox.go:730] Sandbox will be started in the current PID namespace
I0323 10:19:13.412139 108938 sandbox.go:741] Sandbox will be started in the container's network namespace: {Type:network Path:}
I0323 10:19:13.412281 108938 sandbox.go:761] Sandbox will be started in container's user namespace: {Type:user Path:}
I0323 10:19:13.412373 108938 namespace.go:217] Mapping host uid 1 to container uid 0 (size=1000)
I0323 10:19:13.412396 108938 namespace.go:217] Mapping host uid 0 to container uid 1000 (size=1)
I0323 10:19:13.412415 108938 namespace.go:217] Mapping host uid 1001 to container uid 1001 (size=64536)
I0323 10:19:13.412434 108938 namespace.go:225] Mapping host gid 1 to container gid 0 (size=1000)
I0323 10:19:13.412453 108938 namespace.go:225] Mapping host gid 0 to container gid 1000 (size=1)
I0323 10:19:13.412472 108938 namespace.go:225] Mapping host gid 1001 to container gid 1001 (size=64536)
I0323 10:19:13.412704 108938 sandbox.go:779] Sandbox will be started in minimal chroot
W0323 10:19:13.412813 108938 sandbox.go:1360] can't change an owner of /dev/stdin: chown /dev/stdin: operation not permitted
I0323 10:19:13.417543 108938 sandbox.go:978] Sandbox started, PID: 108950
W0323 10:19:13.538708 108938 util.go:64] FATAL ERROR: creating container: cannot create sandbox: cannot read client sync file: waiting for sandbox to start: EOF
W0323 10:19:13.539099 108938 main.go:267] Failure to execute command, err: 1
so I think there's a regression in a recent change?