google / gvisor

Application Kernel for Containers

Home Page:https://gvisor.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

istio/envoy container fails starting w/ version 20211026

emaildanwilson opened this issue · comments

Description

Starting w/ version 20211026 the istio-proxy (envoy) container fails when it is attempting to access shared memory. Version 20211019 works without issue. If vfs2 is disabled then newer versions work as well.

Environment Details
k8s version: v1.20.10-gke.1600
Istio enabled
x64

Envoy error:
[2021-12-08 22:45:06.230][14][critical][assert] [external/envoy/source/server/hot_restart_impl.cc:44] panic: cannot open shared memory region /envoy_shared_memory_0 check user permissions. Error: Permission denied
[2021-12-08 22:45:06.231][14][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] Caught Aborted, suspect faulting address 0x5390000000e
[2021-12-08 22:45:06.231][14][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2021-12-08 22:45:06.231][14][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:92] Envoy version: 172db4cfdf037bc8cf1613969f94c25b6198bc4f/1.17.1/Clean/RELEASE/BoringSSL

Based on the error details it seems to be failing on this syscall: https://github.com/envoyproxy/envoy/blob/main/source/server/hot_restart_impl.cc#L41

Steps to reproduce

Run envoy w/ hot restart enabled on gvisor >= 20211026 and with vfs2 enabled (default).

runsc version

runsc version release-20211026.0
spec: 1.0.2

docker version (if using docker)

containerd --version
containerd github.com/containerd/containerd 1.4.3-0ubuntu0~20.04.1

uname

Linux hostname 5.4.0-1051-gke #54-Ubuntu SMP Thu Aug 5 18:52:13 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

kubectl (if using Kubernetes)

v1.20.10-gke.1600

repo state (if built from source)

No response

runsc debug logs (if available)

No response

VFS1 mounts /dev as overlayfs which cause shm_open to avoid /dev/shm and create files under /tmp. VFS2 on the other hand implements /dev more accurately and reports it as tmpfs (same as Linux). shm_open then attempts to create files under /dev/shm. However, I'm not sure why envoy is getting access denied (I can run it locally). Would you be able to re-run it with debug logging enabled and share the logs?

Here are instructions to enable debug flags in containerd. Please enable the following flags --debug --strace --debug-log=/tmp/sandbox-%ID%/. If you are using GKE Sandbox, you can do that by adding these lines to the end of /run/containerd/runsc/config.toml:

  debug = "true"
  debug-log = "/tmp/sandbox-%ID%/"
  strace = "true"

Never mind, I repro'd locally and can take from here...

This is what is happening...k8s adds /dev/shm as a bind mount inside each container pointing to a shared location so all containers inside a pod can all see the same shared memory files. When libc tries to find what directory to use, it checks whether /dev/shm is using tmpfs or shm. Because runc let the host filesystem type permeate inside the container, the check passes (it uses tmpfs). gVisor on the other hand mounts dev/shm using p9 causing libc to skip it. libc ends up using /dev for the files. However, /dev permissions are set to 0755 and envoy runs as a non-root user, failing to create the shared file.

VFS1 works because it ignores mount instructions for known mounts inside /dev, such as /dev/shm. So it always mounts /dev/shm inside the container as tmpfs. They are not shared among containers running inside the same pod though.

One solution is to detect such case in the shim and set up shared volume annotations to tell gVisor to mount /dev/shm as a tmpfs volume that is shared between all containers. There are a few things that we need to take into account, like someone actually trying to mount something under /dev/shm, but I have a few ideas to try...

To make matters slightly more confusing, both cri and containerd (I believe) add conflicting volume configuration for /dev/shm, one using tmpfs and another one using bind for the pause container. Luckily, the pause container doesn't do anything, so it doesn't really matter which \dev/shm gets mounted, but it does make the detection more complicated. Here is an example of pause container's mount spec for the record:

    {
      "destination": "/dev/shm",
      "type": "tmpfs",
      "source": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/66468fcd22ec18856379a1729c5e30db4b5470ac588bac5ba17c025e166907bd/shm",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "mode=1777",
        "size=65536k"
      ]
    },
    {
      "destination": "/dev/shm",
      "type": "bind",
      "source": "/run/containerd/io.containerd.grpc.v1.cri/sandboxes/66468fcd22ec18856379a1729c5e30db4b5470ac588bac5ba17c025e166907bd/shm",
      "options": [
        "rbind",
        "ro"
      ]
    },