istio/envoy container fails starting w/ version 20211026
emaildanwilson opened this issue · comments
Description
Starting w/ version 20211026 the istio-proxy (envoy) container fails when it is attempting to access shared memory. Version 20211019 works without issue. If vfs2
is disabled then newer versions work as well.
Environment Details
k8s version: v1.20.10-gke.1600
Istio enabled
x64
Envoy error:
[2021-12-08 22:45:06.230][14][critical][assert] [external/envoy/source/server/hot_restart_impl.cc:44] panic: cannot open shared memory region /envoy_shared_memory_0 check user permissions. Error: Permission denied
[2021-12-08 22:45:06.231][14][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] Caught Aborted, suspect faulting address 0x5390000000e
[2021-12-08 22:45:06.231][14][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2021-12-08 22:45:06.231][14][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:92] Envoy version: 172db4cfdf037bc8cf1613969f94c25b6198bc4f/1.17.1/Clean/RELEASE/BoringSSL
Based on the error details it seems to be failing on this syscall: https://github.com/envoyproxy/envoy/blob/main/source/server/hot_restart_impl.cc#L41
Steps to reproduce
Run envoy w/ hot restart enabled on gvisor >= 20211026 and with vfs2 enabled (default).
runsc version
runsc version release-20211026.0
spec: 1.0.2
docker version (if using docker)
containerd --version
containerd github.com/containerd/containerd 1.4.3-0ubuntu0~20.04.1
uname
Linux hostname 5.4.0-1051-gke #54-Ubuntu SMP Thu Aug 5 18:52:13 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
kubectl (if using Kubernetes)
v1.20.10-gke.1600
repo state (if built from source)
No response
runsc debug logs (if available)
No response
VFS1 mounts /dev
as overlayfs
which cause shm_open
to avoid /dev/shm
and create files under /tmp
. VFS2 on the other hand implements /dev
more accurately and reports it as tmpfs
(same as Linux). shm_open
then attempts to create files under /dev/shm
. However, I'm not sure why envoy is getting access denied (I can run it locally). Would you be able to re-run it with debug logging enabled and share the logs?
Here are instructions to enable debug flags in containerd. Please enable the following flags --debug --strace --debug-log=/tmp/sandbox-%ID%/
. If you are using GKE Sandbox, you can do that by adding these lines to the end of /run/containerd/runsc/config.toml
:
debug = "true"
debug-log = "/tmp/sandbox-%ID%/"
strace = "true"
Never mind, I repro'd locally and can take from here...
This is what is happening...k8s adds /dev/shm
as a bind mount inside each container pointing to a shared location so all containers inside a pod can all see the same shared memory files. When libc tries to find what directory to use, it checks whether /dev/shm
is using tmpfs
or shm
. Because runc let the host filesystem type permeate inside the container, the check passes (it uses tmpfs
). gVisor on the other hand mounts dev/shm
using p9 causing libc to skip it. libc ends up using /dev
for the files. However, /dev
permissions are set to 0755 and envoy runs as a non-root user, failing to create the shared file.
VFS1 works because it ignores mount instructions for known mounts inside /dev
, such as /dev/shm
. So it always mounts /dev/shm
inside the container as tmpfs
. They are not shared among containers running inside the same pod though.
One solution is to detect such case in the shim and set up shared volume annotations to tell gVisor to mount /dev/shm
as a tmpfs volume that is shared between all containers. There are a few things that we need to take into account, like someone actually trying to mount something under /dev/shm
, but I have a few ideas to try...
To make matters slightly more confusing, both cri and containerd (I believe) add conflicting volume configuration for /dev/shm
, one using tmpfs
and another one using bind
for the pause
container. Luckily, the pause
container doesn't do anything, so it doesn't really matter which \dev/shm
gets mounted, but it does make the detection more complicated. Here is an example of pause
container's mount spec for the record:
{
"destination": "/dev/shm",
"type": "tmpfs",
"source": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/66468fcd22ec18856379a1729c5e30db4b5470ac588bac5ba17c025e166907bd/shm",
"options": [
"nosuid",
"noexec",
"nodev",
"mode=1777",
"size=65536k"
]
},
{
"destination": "/dev/shm",
"type": "bind",
"source": "/run/containerd/io.containerd.grpc.v1.cri/sandboxes/66468fcd22ec18856379a1729c5e30db4b5470ac588bac5ba17c025e166907bd/shm",
"options": [
"rbind",
"ro"
]
},