moby / moby

The Moby Project - a collaborative project for the container ecosystem to assemble container-based systems

Home Page:https://mobyproject.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Evented Process Monitor

crosbymichael opened this issue · comments

TLDR:

Pros:

  • Evented lock free code
  • 1 goroutine for all containers
  • Generic zombie fighting abilities for free

Cons:

  • none

Currently docker uses exec.Cmd and cmd.Wait() inside a goroutine for blocking on a container's processes until it finishes. After a container dies two things can happen. One, the container's process is restarted depending if the user requested that the container always restart. Two, we tear down the container and update it's status with the exit code and release any resources required.

Writing a process monitor like this is not very efficient and docker is unable to handle SIGCHLD events to reap zombies that were not direct children of the daemon's process.

It also means that if we have one goroutine per container, if we have 1000 containers we have 1000 blocking goroutines in the daemon. Booooo.

We can do better. The proper way is to move to an evented system for monitoring when child processes change state. This can be handled via SIGCHLD. A process can setup a signal handler for SIGCHLD events and when the status of a child process changes this signal is sent to the handler. We can use this to extract the pid, exit status, and make decision on how to handle the event.

Using an evented system like this we can reduce the number of goroutines to 1 for N number of containers also reduce the amount of locks that are required to handle the previous level of concurrency. Running one container will require 1 goroutine, running 10k container's will require 1 goroutine, win. This model also allows us to reap zombies, because zombies are bad m'kay, in the daemon process that are not direct children. i.e. non container PID1's.

A sample code of what the process monitor would look like is follows:

package main

import (
    "os"
    "os/exec"
    "os/signal"
    "sync"
    "syscall"
    "time"

    "github.com/Sirupsen/logrus"
)

var pidsICareAbout map[int]string

func handleEvents(signals chan os.Signal, group *sync.WaitGroup) {
    defer group.Done()
    for sig := range signals {
        switch sig {
        case syscall.SIGTERM, syscall.SIGINT:
            // return cuz user said so
            return
        case syscall.SIGCHLD:
            var (
                status syscall.WaitStatus
                usage  syscall.Rusage
            )
            pid, err := syscall.Wait4(-1, &status, syscall.WNOHANG, &usage)
            if err != nil {
                logrus.Error(err)
            }
            logrus.WithField("pid", pid).Info("process status changed")
            if _, ok := pidsICareAbout[pid]; ok {
                logrus.Infof("i care about %d", pid)
                // we can modify the map without a lock because we have a single
                // goroutine and handler for signals.
                delete(pidsICareAbout, pid)
                if len(pidsICareAbout) == 0 {
                    // return after everything is dead
                    return
                }
            } else {
                logrus.Infof("---> i don't care about %d", pid)
            }
        }
    }
}

func runSomething() error {
    // just add some delay for demo
    time.Sleep(1 * time.Second)
    cmd := exec.Command("sh", "-c", "sleep 5")
    // cmd.Start() is non blocking
    if err := cmd.Start(); err != nil {
        return err
    }
    // get the pid because I care about this one.
    logrus.WithField("pid", cmd.Process.Pid).Info("spawned new process")
    pidsICareAbout[cmd.Process.Pid] = "sleep 5"
    return nil
}

func randomFork() error {
    syscall.ForkLock.Lock()
    pid, _, err := syscall.RawSyscall(syscall.SYS_FORK, 0, 0, 0)
    syscall.ForkLock.Unlock()
    if err != 0 {
        return err
    }
    if pid == 0 {
        logrus.Info("i'm on a boat")
        os.Exit(0)
    } else {
        logrus.Infof("forked off %d", pid)
    }
    return nil
}

func main() {
    signals := make(chan os.Signal, 1024)
    signal.Notify(signals, syscall.SIGCHLD, syscall.SIGTERM, syscall.SIGINT)
    pidsICareAbout = make(map[int]string)
    group := &sync.WaitGroup{}
    group.Add(1)
    go handleEvents(signals, group)
    for i := 0; i < 5; i++ {
        if err := runSomething(); err != nil {
            logrus.Fatal(err)
        }
    }
    // fork off a random process that we don't care about but make sure the
    // signal handler reaps it when it dies.
    if err := randomFork(); err != nil {
        logrus.Error(err)
    }
    logrus.Info("waiting on processes to finish")
    group.Wait()
    logrus.Info("all processes are done, exiting...")
}

We should build this in a generic way because we want this monitor with restart capabilities to be available to any type of process the daemon can spawn.

Multiple child processes dying quickly could generate a single SIGCHLD so would have to syscall.Wait4 in a loop till you get -1 unless golang is changing signal semantics somehow.

@mrunalp no, you are correct.

I do declare, that's awesome.

+1

On Fri, Mar 20, 2015 at 9:07 PM, Brian Goff notifications@github.com
wrote:

I do declare, that's awesome.


Reply to this email directly or view it on GitHub
#11529 (comment).

Randy D. Wallace Jr.
randy@randywallace.com
646-484-8004

@crosbymichael I was thinking of a design where this would be a drop in replacement for exec.Command. Since this impacts all code that forks process, quite a bit will have to change. Additionally, no new code should ever use exec.Command in Docker. By having a compatible API we could easily switch and then essentially blacklist exec.Command and in CI scan the code for references to it.

Obviously this doesn't buy any benefits in the terms of reducing goroutines. The new Command.Wait() will just communicate over a channel to the central signal handling loop. To address the goroutines I'm thinking we add Command.OnExit(func(int)) and Command.StartFor(handler). OnExit() will allow one to register a callback for that specific process. StartFor(handler) would start a process and then on exit notify a common handler that could be registered through some API like ProcessManager.AddHandler(name string, handler ProcessHandler). I imagine libcontainer would have one generic handler for all containers.

I'd just like to raise the canary of caution 🐦 with respect to introducing a non-idiomatic approach to such a common task. Callbacks seem quite a strange suggestion in go. But then, in a way, avoiding a few thousand goroutines also seems a bit strange. It's usually just said, "start a goroutine, they're cheap!". Have large numbers of goroutines proven themselves to be a problem?

The approach I described still allows for a go idiomatic approach Wait(), but will have to be using a custom package name and not exec. The number of goroutine isn't a huge issue, the root issue just has to do with the calling style. With the current approach it's really hard to cleanly manage zombies. Its a really common pattern in supervisor style code to have an event driven model to handle processes. Because of some things in the guts of golang we can't combine an event driven model with other goroutines calling Command.Wait()

Generic zombie fighting abilities for free

This is a big deal. It's one of the things that are hurting people in production. I don't think it should be a requirement for a large-scale Docker deployment to have to care about init.

A lot of people currently don't write custom init processes for their containers, because they don't realize what responsibilities an init actually has (or that they even need one). It's fine if you're always running a single process inside your container, but most applications fork+exec at some point (i.e. to shell out) and don't handle the case where the parent of such a process dies before the child does. In the past our init didn't ack children, which means you can't terminate the container completely since the zombie sticks around in the kernel data structure. This can lead to incredibly obscure bugs:

The kernel maintains a minimal set of information about the zombie process (PID, termination status, resource usage information) in order to allow the parent to later perform a wait to obtain information about the child. As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes.

Meaning if someone is running a Docker deployment with high container churn (which is common), e.g. an application with a master-worker model where the workers get killed if they take too long. If those workers are wait()ing on e.g. a shelled out command, that'll leak. Eventually the process table will fill up—I'm sure this is already happening in production deployments and people have no clue what's going on.

I don't think it should be required for Docker users, even in very large deployments, to have this much of an understand of the PID namespace (it's documented on the man page) and their container layouts. Docker should definitely be reaping children to avoid this problem.

Concerns

What happens when you have multiple processes trying to wait4(2) on the same processes? The custom init people have inside their containers may be updating state (internally or externally) when a process quits. AFAIK both processes receive a SIGCHL and get the process returned by wait(2), but something to write a test for.

Another question I'd have is whether the kernel assumes anything else about init? As far as I know it doesn't, but kernel developers develop with the reaper living inside the PID namespace in mind as it's not common for processes to wait(4) across immediate parent/child boundaries

Observations from inits in the wild

I can't think of any other responsibilities an init should have that can't be covered by the planned plugin/extension API. This means people do not have to care about init processes anymore. Most custom inits I've seen in the wild do things such as:

  • Secret decryption. This should eventually be solved by Docker, possibly with a certificate-based model or through a local daemon where containers request secrets.
  • Refresh TTL for service discovery. This can be solved with the API and a side-kick container.
  • Ensure the ENV is right. Many deployments rely on some environment variables being set for containers in their environment to work (and have wrapper scripts around docker run to make it convenient). This could be solved in Docker, however, I think in most cases this is rather hacky and not something that should be encouraged.
  • Executing from a Procfile. I've seen people in the wild, including us, who have a key-value file inside their application code (in JSON, YAML, or whatever) and the init translates the key (e.g. web) to the command. However, I don't think it's a big deal to get rid of it.
  • Proxying signals. It needs to proxy signals down to the child. This is no longer necessary if the init does not need to be custom as pid 0 will be the main process of the container.

Mostly I think all this functionality is dangerous to encourage because if any of it blocks (particularly easy in service discovery code) the container will not shut down. Similarly if it crashes, the entire container crashes. Docker should not encourage people to write custom inits.

The reason why this is currently happening, is because larger deployments need an init anyway with a little bit of custom functionality—it becomes a slippery slope to just shove more in there. Most of this could be implemented by using the Docker API, which IMO should be the encouraged way since it's much easier. It's currently somewhat painful, because it does require running a sidekick process on every host to listen to the API and report it (error reporting, service discovery, ..).

To sum up: Let's encourage people to not write custom inits. It's dangerous. This is a major step in that direction.

I can open a separate issue on this matter since it's somewhat orthogonal to this discussion.

The missing piece from my description is to set the subreaper option on the docker daemon to support this. The full docs can be found here but here is a quick description.

PR_SET_CHILD_SUBREAPER (since Linux 3.4)
              If arg2 is nonzero, set the "child subreaper" attribute of the
              calling process; if arg2 is zero, unset the attribute.  When a
              process is marked as a child subreaper, all of the children
              that it creates, and their descendants, will be marked as
              having a subreaper.  In effect, a subreaper fulfills the role
              of init(1) for its descendant processes.  Upon termination of
              a process that is orphaned (i.e., its immediate parent has
              already terminated) and marked as having a subreaper, the
              nearest still living ancestor subreaper will receive a SIGCHLD
              signal and be able to wait(2) on the process to discover its
              termination status.

Had a good discussion with @crosbymichael on this on IRC, agree with setting PR_SET_CHILD_SUBREAPER does exactly what I want. This needs to come with a docs change clarifying Docker and init processes as well as recommended patterns.

Yeah, won't be much of a reaper without it unless running as pid 1.

@ibuildthecloud I havn't thought about it much but maybe we could model this after the http.Handler with a middle router/muxer for handling exits.

Don't stress over the first implementation, we can get something working and then itterate on the API

From what @mrunalp said about the look for reaping fast existing children, I experimented and you have to handle at ECHILD like below:

func reapChildren() (exits []exit, err error) {
        for {
                var (
                        ws  syscall.WaitStatus
                        rus syscall.Rusage
                )
                pid, err := syscall.Wait4(-1, &ws, syscall.WNOHANG, &rus)
                if err != nil {
                        if err == syscall.ECHILD {
                                return exits, nil
                        }
                        return nil, err
                }
                exits = append(exits, exit{
                        pid:    pid,
                        status: ws.ExitStatus(),
                })
        }
}

@crosbymichael Are you working on an implementation of this? If you're not I can start on something.

No i'm not working on it.

WANT!!!!

Spammy +1 :P

No fair, why does @crosbymichael get all the fun PRs :) j/k looking forward to the implementation.

@ibuildthecloud I think he meant he's not working on it :)

@sirupsen lol, man I suck at reading. I'll starting working on this.

Added to 1.9.0 coz #yolo

EDIT: we're still looking for volunteers to help fix this :) /cc @sirupsen @ibuildthecloud

@tiborvass should be added to ROADMAP.md?

@thaJeztah you're late to the game :)

@tiborvass @crosbymichael I was under the impression that runc was going to address this as it already has the reaping code in it. Is that not the case? Otherwise I'll try to take this on for 1.9

@ibuildthecloud some of the reaping issues will be fixed but using an evented process monitor is a much more light weight and robust method to monitor for processes.

Now that it exists, would it be possible to just use tini, or are there still some issues that need to be resolved, and, if so, what are they?

@tiborvass still looking for volunteers?

@mynameismevin yes please :)

I'll work on coming up with something based off the discussion that's been had thus far.

Has progress stalled on this issue? I noticed that @mynameismevin's PR was not merged.

Just to be clear, this would solve the current problem of processes having PID 1 in a container and having to reap grand-children, right?

@edevil possibly this will now be implemented through containerd; https://github.com/docker/containerd

@crosbymichael am I right?

@edevil my PR was really just a summation of this conversation, and I did nothing outside of just providing some code in a package for others to use in solving this problem. I didn't fix the issue, just assisted with moving it closer towards resolution.

Also, I didn't test it when it came to reaping grandchildren, however I did test abandoned child processes, so theoretically it could clean up abandoned grand-children so long as the child cleaned up the grand-child. No promises on that actually working as it wasn't tested.

This is directly linked to #20361, and is on our 1.11 roadmap.

AFAIK containerd doesn't solve the zombie problem. This issue was opened for evented process monitoring, which I assume has been fixed w/ containerd, but the sub discussions about zombie killing has not been addressed. @sirupsen I would probably open a new issue. Can somebody correct me if I'm wrong, because I'm not seeing zombie killing working in 1.11.0-rc3. It's my understanding that process subreapers don't work across a PID namespace boundary. So if you child creates a new PID namespace and you are set to the subreaper of the child, processes will still be reparented to the PID 1 of the child PID namespace and not yourself (the subreaper).

@ibuildthecloud You are correct, sadly :(

So using an init system (like tini) is still the way to go to solve the zombie processes, right?