google / gvisor

Application Kernel for Containers

Home Page:https://gvisor.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unable to connect to sandbox-created Unix domain socket when waiting for connection using epoll_ctl

gmwiz opened this issue · comments

Description

I'm trying to run MongoDB using gVisor in a way that MongoDB will listen on a Unix-domain socket and a connection to it will be made outside of the sandbox. This simple setup does not work on any version of MongoDB I tried so far, but works on pretty much any other application I've tried.
This reproduces easily on the latest gVisor version (20231218.0), and happens also when invoked under podman or directly using runsc with the default OCI spec emitted by runsc spec with the addition of the following block under mounts:

        {
            "destination": "/shared",
            "type": "bind",
            "source": "shared",
            "options": [
                "rbind"
            ]
        }

One thing I've noticed that is specific to MongoDB is that the UDS is marked non-blocking, and that the socket is wrapped using the ASIO library. I was able to reproduce this failure when writing a C program that mimics the behavior demonstrated by MongoDB, as observed by strace (see following comment). Another thing to note is that connecting to the shared socket from within the sandbox works as expected.
Any help or guidance will be highly appreciated.

Steps to reproduce

sudo runsc install --runtime runsc-unix-debug -- \
  --host-uds=all \
  --debug \
  --debug-log=/tmp/runsc-debug/ \
  --strace \
  --log-packets
docker run --runtime=runsc-unix-debug -v $(pwd)/shared:/shared --rm -it --entrypoint '' mongo:4.4.26 mongod --bind_ip=/shared/mongo.sock
sudo mongo "mongodb://shared%2fmongo.sock" --shell --verbose

Which will hang indefinetly:

connecting to: mongodb://shared%2Fmongo.sock/?compressors=disabled&gssapiServiceName=mongodb
D1 NETWORK  [js] creating new connection to:shared/mongo.sock
D1 NETWORK  [js] connected to server shared/mongo.sock

runsc version

runsc version release-20231218.0
spec: 1.1.0-rc.1

docker version (if using docker)

Client: Docker Engine - Community
 Version:           24.0.7
 API version:       1.43
 Go version:        go1.20.10
 Git commit:        afdd53b
 Built:             Thu Oct 26 09:08:02 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          24.0.7
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.10
  Git commit:       311b9ff
  Built:            Thu Oct 26 09:08:02 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.26
  GitCommit:        3dd1e886e55dd695541fdcd67420c2888645a495
 runc:
  Version:          1.1.10
  GitCommit:        v1.1.10-0-g18a0cb0
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

uname

6.6.8-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.6.8-1 (2023-12-22) x86_64 GNU/Linux

runsc debug logs (if available)

runsc-debug.log

I was able to further narrow it down, and create a simple C reproducer that demonstrates the issue. It seems as if MongoDB Unix-domain sockets are marked non-blocking, and then epoll_wait is used to wait for a new client (before calling accept). It also seems like MongoDB uses epoll_ctl to track the socket, and it does so immediately after creating the socket, and before calling bind on the socket. When re-ordering the calls, moving the epoll_ctl call after bind, everything seems to work perfectly. I'm not entirely certain what component in the gVisor code base is responsible for this, but I'll try and take a look.

#define _GNU_SOURCE
#include <stdio.h>
#include <errno.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <sys/epoll.h>


int main(int argc, char *argv[]) {
    int sock_fd = -1;
    int flags = 0;
    int ret = -1;
    int client_fd = -1;
    int epoll_fd = -1;
    struct epoll_event epoll_read = {
        .events = EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP|EPOLLET,
    };
    struct sockaddr_un addr = {0};
    char *sock_path = NULL;
    char *msg = NULL;

    if (argc < 3) {
        printf("<socket_path> <msg>\n");
        return 1;
    }

    sock_path = argv[1];
    msg = argv[2];

    epoll_fd = epoll_create1(EPOLL_CLOEXEC);
    if (epoll_fd < 0) {
        perror("epoll_create");
        return 1;
    }

    printf("socket\n");
    sock_fd = socket(AF_UNIX, SOCK_STREAM|SOCK_NONBLOCK, 0);
    if (sock_fd < 0) {
        perror("socket");
        return 1;
    }

    if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, sock_fd, &epoll_read) < 0) {
        perror("epoll_ctl");
        return 1;
    }

    printf("bind\n");
    unlink(sock_path);
    addr.sun_family = AF_UNIX;
    strcpy(addr.sun_path, sock_path);
    if (bind(sock_fd, &addr, sizeof(addr)) < 0) {
        perror("bind");
        return 1;
    }

    printf("listen\n");
    if (listen(sock_fd, 128) < 0) {
        perror("listen");
        return 1;
    }

    do {
        struct epoll_event ev[128] = {0};
        ret = epoll_wait(epoll_fd, ev, 128, -1);
    } while (ret < 0 && errno == EINTR);

    if (ret <= 0) {
        perror("epoll_wait");
        return 1;
    }

    printf("accept\n");
    client_fd = accept(sock_fd, NULL, NULL);
    if (client_fd < 0) {
        perror("accept");
        return 1;
    }

    printf("send: %s\n", msg);
    send(client_fd, msg, strlen(msg), 0);
    
    printf("done!\n");
    return 0;
}

Thanks for the bug report, and for narrowing it down with the reproducer. That was really helpful.

I think I have fixed the issue in #9849. Lets see if the e2e test concur.

Hi @ayushr2
Thank you so much for the super fast fix!!! That's amazing!
Do you happen to have an estimation of when will this patch be included in the next official release?
TIA

Hoping to have a release on Jan 3rd. cc @manninglucas

Ah, we released the Monday candidate (which doesn't contain the fix). You can expect us to make a release for 01/08 candidate next Wednesday.