google / gvisor

Application Kernel for Containers

Home Page:https://gvisor.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

runsc doesn't work with rootless podman

sdeoras opened this issue · comments

I am trying to evaluate the use of gVisor via [podman](https://github.com/containers/libpod) that allows container creation in rootless mode. gVisor works fine via sudo but panics when in rootless mode. Below is stack trace and other relevant info.

system info:

uname -a
Linux 4.18.0-21-generic #22~18.04.1-Ubuntu SMP Thu May 16 15:07:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
runsc --version
runsc version 90a116890fce
spec: 1.0.1-dev

permissions on runsc

ls -la `which runsc`
-rwxr-xr-x 1 root root 20123510 Jun  4 01:20 /usr/bin/runsc
podman --version
podman version 1.3.2-dev

works fine when sudo
sudo podman --runtime=runsc run --rm -it docker.io/library/ubuntu:latest bash

runc works fine in both root and rootless modes

sudo podman --runtime=runc run --rm -it docker.io/library/ubuntu:latest bash
podman --runtime=runc run --rm -it docker.io/library/ubuntu:latest bash

panics when running in podman/rootless mode
stack trace:

podman --runtime=runsc run --rm -it docker.io/library/ubuntu:latest bash
I0604 20:47:56.621539   21217 x:0] ***************************
I0604 20:47:56.621611   21217 x:0] Args: [/usr/bin/runsc start fb6738612f208a2786470ab33803763b976290531168c3716ea72b30ae74f310]
I0604 20:47:56.621673   21217 x:0] Version 90a116890fce
I0604 20:47:56.621686   21217 x:0] PID: 21217
I0604 20:47:56.621698   21217 x:0] UID: 0, GID: 0
I0604 20:47:56.621706   21217 x:0] Configuration:
I0604 20:47:56.621712   21217 x:0] 		RootDir: /run/user/1000/runsc
I0604 20:47:56.621720   21217 x:0] 		Platform: ptrace
I0604 20:47:56.621734   21217 x:0] 		FileAccess: exclusive, overlay: false
I0604 20:47:56.621744   21217 x:0] 		Network: sandbox, logging: false
I0604 20:47:56.621755   21217 x:0] 		Strace: false, max size: 1024, syscalls: []
I0604 20:47:56.621762   21217 x:0] ***************************
I0604 20:47:56.625479   21217 x:0] Setting up network
I0604 20:47:56.625961   21217 x:0] Applying namespace network at path "/proc/21187/ns/net"
I0604 20:47:56.626170   21217 x:0] Skipping down interface: {Index:1 MTU:65536 Name:lo HardwareAddr: Flags:loopback}
W0604 20:47:56.626272   21217 x:0] IPv6 is not supported, skipping: fe80::40b5:4cff:fe3c:9d9/64
W0604 20:47:56.649254   21217 x:0] IPv6 is not supported, skipping route: {Ifindex: 2 Dst: fe80::/64 Src: <nil> Gw: <nil> Flags: [] Table: 254}
I0604 20:47:56.649925   21217 x:0] Restoring namespace network
panic: error restoring namespace: of type network: operation not permitted

goroutine 1 [running, locked to thread]:
gvisor.googlesource.com/gvisor/runsc/specutils.ApplyNS.func1()
runsc/specutils/namespace.go:146 +0x29d
gvisor.googlesource.com/gvisor/runsc/sandbox.joinNetNS.func1()
runsc/sandbox/network.go:119 +0x24
gvisor.googlesource.com/gvisor/runsc/sandbox.createInterfacesAndRoutesFromNS(0xc00019eb60, 0xc0001d6160, 0x12, 0xc0001d6101, 0xe27480, 0xc00019ec40)
runsc/sandbox/network.go:274 +0x10d0
gvisor.googlesource.com/gvisor/runsc/sandbox.setupNetwork(0xc00019eb60, 0x52c3, 0xc000097420, 0xc000178000, 0x2, 0xc000068080)
runsc/sandbox/network.go:71 +0x380
gvisor.googlesource.com/gvisor/runsc/sandbox.(*Sandbox).StartRoot(0xc0001558c0, 0xc000097420, 0xc000178000, 0x0, 0x0)
runsc/sandbox/sandbox.go:139 +0x192
gvisor.googlesource.com/gvisor/runsc/container.(*Container).Start(0xc0000d23c0, 0xc000178000, 0x0, 0x0)
runsc/container/container.go:397 +0x288
gvisor.googlesource.com/gvisor/runsc/cmd.(*Start).Execute(0x14e48c0, 0xe38480, 0xc000044008, 0xc0001684e0, 0xc000136780, 0x2, 0x2, 0x7fcc5f2b4008)
runsc/cmd/start.go:61 +0x139
github.com/google/subcommands.(*Commander).Execute(0xc000096000, 0xe38480, 0xc000044008, 0xc000136780, 0x2, 0x2, 0x13)
external/com_github_google_subcommands/subcommands.go:141 +0x2fb
github.com/google/subcommands.Execute(...)
external/com_github_google_subcommands/subcommands.go:371
main.main()
runsc/main.go:245 +0x1452
Killed


I don't have experience with podman, so I don't really know what podman rootless is doing. The code above joined the network namespace to configure the network and it's trying to restore back to the original namespace, which should be allowed. Not sure why it's failing.

Having said that, runsc requires the caller to be root right now. It would be nice to make runsc work rootless under a flag, especially for runsc do, but we don't have immediate plans to do that. If you are interested in working on it, I can help you getting started...

@avagin has poked around this recently.

I don't think we require the caller to be root, we just create new namespaces by default.

I think there's an explicit test for this behavior with runsc do:
https://github.com/google/gvisor/blob/master/tools/run_tests.sh#L212

Maybe there would be a way to detect sufficient namespaces that we skip it? Versus having to pass --netns=none.

@fvoznika, thanks for taking a look at the bug. If there is nothing fundamentally blocking runsc from running in rootless mode I would be interested in helping out resolving this bug. Let me spend some time with the code so I have relevant questions to ask you. Thanks again!

Ideally we would be able to run runsc --rootless <cmd> without the need to be root or unshare. However, this is difficult because many of the defense in depth steps that runsc takes require CAP_SYS_ADMIN.

In summary, runsc enters/creates namespaces, maps user/groups in the new namespace, calls pivot_root and chroot, and mounts /proc inside the new roots. Many of these operations require CAP_SYS_ADMIN, which unshare -Ur solves. However, mounting /proc requires being the root, not only root inside a user namespace. This is why the test @amscanne pointed to is using --TESTONLY-unsafe-nonroot. This flag makes the sandbox run as the same user that called runsc create (in this case, root inside the user namespace unshare created), and will not chroot the sandbox process to an empty directory.

I think we can remove /proc usage so that --TESTONLY-unsafe-nonroot is not needed anymore, only CAP_SYS_ADMIN, CAP_SYS_CHROOT, etc that are acquired via unshare. I'm not sure what kind of configuration controls you have with podman. Is there a way to configure it (or add a plugin) that will create a new user namespace and execute runsc as root inside this namespace? If not, maybe creating a wrapper script that intercepts calls to runsc create ... and calls unshare -Ur runsc create ...?

@fvoznika 356d1be added support for --rootless in 'runsc do'. What's missing to make it work for 'runsc create'?

@prattmic runsc create requires to set uid and gid mappings which has to be set via newuidmap.

@giuseppe PTAL
Podman already does the user namespace setup and configuration.

AFAICS, the issue seems to be in the function https://github.com/google/gvisor/blob/master/runsc/specutils/namespace.go#L143-L149

oldNS points to a namespace not owned by the rootless user (in this case the network namespace on the host), so gVisor fails to re-join it when running in a user namespace.

A possible solution is to let the code run in a goroutine and on error, keep the OS thread locked, so that the Go Runtime will destroy the underlying thread when the go routine ends. From: https://golang.org/pkg/runtime/#LockOSThread

 LockOSThread wires the calling goroutine to its current operating system thread.
The calling goroutine will always execute in that thread, and no other goroutine will
execute in it, until the calling goroutine has made as many calls to UnlockOSThread
as to LockOSThread. If the calling goroutine exits without unlocking the thread, the
thread will be terminated. 

so there is no risk another goroutine will run in the wrong namespace

This looks very promising and my feeling is that this is only few steps away. @fvoznika would you have a chance to look what @giuseppe recommended?

Podman already does the user namespace setup and configuration.

yeah, rootless here is not the same rootless that we think about. podman creates a user namespace, sets user and group mappings, and executes gvisor there under the root user with all capabilities.

The idea with LockOSThread is good, but we fork gofer and sandbox processes with pdeathsig and it means that they die when their parent thread exits. We can block the current system thread if one of namespaces can't be restored.

With the following changes, I was able to start a podman rootless container: avagin@db868af

I used this wrapper for runsc to set custom options:

$ cat /usr/local/bin/runsc-podman
#!/bin/bash

/usr/local/bin/runsc --network host --ignore-cgroups --debug --debug-log '/tmp/runsc/runsc.log.%TEST%.%TIMESTAMP%.%COMMAND%' "$@"

And now, we are ready to run a container:

$ podman --runtime /usr/local/bin/runsc-podman  run  --security-opt=label=disable  docker.io/library/busybox echo Hello, World
Hello, World

The idea with LockOSThread is good, but we fork gofer and sandbox processes with pdeathsig and it means that they die when their parent thread exits. We can block the current system thread if one of namespaces can't be restored.

Should the thread be locked in any case? I had to troubleshoot a similar error in the past: containers/storage#530

It turned out that the Go runtime can terminate threads at its will without any way of controlling it from the application (at least I didn't find it).

It turned out that the Go runtime can terminate threads at its will without any way of controlling it from the application (at least I didn't find it).

I have never seen that goruntime destroys system threads, except a case when goroutine locked to a system thread exits.

The wrapper provided in #311 (comment) worked for me to use runsc in rootless podman, but it's broken again recently (in 20230320.0 and also the version before it, it worked in 2 versions before that one). I'm getting this from runsc's debug log:

$ cat /tmp/runsc/runsc.log..20230323-101913.399926.create
I0323 10:19:13.400219  108938 main.go:222] ***************************
I0323 10:19:13.400376  108938 main.go:223] Args: [/usr/bin/runsc --network host --ignore-cgroups --debug-log /tmp/runsc/runsc.log.%TEST%.%TIMESTAMP%.%COMMAND% --systemd-cgroup create --bundle /home/fishy/.local/share/containers/storage/overlay-containers/71b85f92c1756e2f6e10da0ef005dbfb8584164a52e2c694ae1c051f678547f7/userdata --pid-file /run/user/1000/containers/overlay-containers/71b85f92c1756e2f6e10da0ef005dbfb8584164a52e2c694ae1c051f678547f7/userdata/pidfile 71b85f92c1756e2f6e10da0ef005dbfb8584164a52e2c694ae1c051f678547f7]
I0323 10:19:13.400483  108938 main.go:224] Version release-20230320.0
I0323 10:19:13.400544  108938 main.go:225] GOOS: linux
I0323 10:19:13.400603  108938 main.go:226] GOARCH: amd64
I0323 10:19:13.400664  108938 main.go:227] PID: 108938
I0323 10:19:13.400728  108938 main.go:228] UID: 0, GID: 0
I0323 10:19:13.400789  108938 main.go:229] Configuration:
I0323 10:19:13.400848  108938 main.go:230]              RootDir: /run/user/1000/runsc
I0323 10:19:13.400908  108938 main.go:231]              Platform: ptrace
I0323 10:19:13.400967  108938 main.go:232]              FileAccess: exclusive
I0323 10:19:13.401031  108938 main.go:233]              Directfs: false
I0323 10:19:13.401091  108938 main.go:235]              Overlay: Root=true, SubMounts=false, Medium="self"
I0323 10:19:13.401153  108938 main.go:236]              Network: host, logging: false
I0323 10:19:13.401217  108938 main.go:237]              Strace: false, max size: 1024, syscalls: 
I0323 10:19:13.401277  108938 main.go:238]              IOURING: false
I0323 10:19:13.401337  108938 main.go:239]              Debug: false
I0323 10:19:13.401397  108938 main.go:240]              Systemd: true
I0323 10:19:13.401456  108938 main.go:241] ***************************
W0323 10:19:13.404457  108938 specutils.go:123] noNewPrivileges ignored. PR_SET_NO_NEW_PRIVS is assumed to always be set.
I0323 10:19:13.406269  108938 namespace.go:217] Mapping host uid 1 to container uid 0 (size=1000)
I0323 10:19:13.406314  108938 namespace.go:217] Mapping host uid 0 to container uid 1000 (size=1)
I0323 10:19:13.406337  108938 namespace.go:217] Mapping host uid 1001 to container uid 1001 (size=64536)
I0323 10:19:13.406356  108938 namespace.go:225] Mapping host gid 1 to container gid 0 (size=1000)
I0323 10:19:13.406375  108938 namespace.go:225] Mapping host gid 0 to container gid 1000 (size=1)
I0323 10:19:13.406394  108938 namespace.go:225] Mapping host gid 1001 to container gid 1001 (size=64536)
I0323 10:19:13.410801  108938 container.go:1241] Gofer started, PID: 108945
I0323 10:19:13.411928  108938 sandbox.go:684] Control socket: ""
I0323 10:19:13.412063  108938 sandbox.go:720] Sandbox will be started in new mount, IPC and UTS namespaces
I0323 10:19:13.412105  108938 sandbox.go:730] Sandbox will be started in the current PID namespace
I0323 10:19:13.412139  108938 sandbox.go:741] Sandbox will be started in the container's network namespace: {Type:network Path:}
I0323 10:19:13.412281  108938 sandbox.go:761] Sandbox will be started in container's user namespace: {Type:user Path:}
I0323 10:19:13.412373  108938 namespace.go:217] Mapping host uid 1 to container uid 0 (size=1000)
I0323 10:19:13.412396  108938 namespace.go:217] Mapping host uid 0 to container uid 1000 (size=1)
I0323 10:19:13.412415  108938 namespace.go:217] Mapping host uid 1001 to container uid 1001 (size=64536)
I0323 10:19:13.412434  108938 namespace.go:225] Mapping host gid 1 to container gid 0 (size=1000)
I0323 10:19:13.412453  108938 namespace.go:225] Mapping host gid 0 to container gid 1000 (size=1)
I0323 10:19:13.412472  108938 namespace.go:225] Mapping host gid 1001 to container gid 1001 (size=64536)
I0323 10:19:13.412704  108938 sandbox.go:779] Sandbox will be started in minimal chroot
W0323 10:19:13.412813  108938 sandbox.go:1360] can't change an owner of /dev/stdin: chown /dev/stdin: operation not permitted
I0323 10:19:13.417543  108938 sandbox.go:978] Sandbox started, PID: 108950
W0323 10:19:13.538708  108938 util.go:64] FATAL ERROR: creating container: cannot create sandbox: cannot read client sync file: waiting for sandbox to start: EOF
W0323 10:19:13.539099  108938 main.go:267] Failure to execute command, err: 1

so I think there's a regression in a recent change?