Propose better way to run docker from a unit file

Question

Propose better way to run docker from a unit file

ibuildthecloud opened this issue 10 years ago · comments

Systemd does a lot of stuff. Docker does a lot of stuff. That stuff may or may not overlap. I don't really care. I just need to solve one very specific problem. I just need a sane way to launch Docker containers in a systemd environment as a system service. As it stands today, the only way I know how is to do docker start -a or docker run ... without -d. Then dockerd launches the container in the background and systemd essentially monitors the docker client. Two problems with this. First, whether or not the docker client is running says very little about whether the actual container is running. Second, I'm left with a rather large docker run process in memory that's not providing much value except to stream stdout/stderr to journald.

So I hacked up the below script to make things better, or really just to see if it was possible to make things better since the script is just a dirty hack. You don't really need to read the script, just skip down and I'll explain what it does.

#!/bin/bash
set -e

ID=$(/usr/bin/docker "$@")
PID=$(docker inspect -f '{{.State.Pid}}' $ID)

declare -A SRC DEST

for line in $(grep slice /proc/$PID/cgroup); do
        IFS=: read _ NAME LOC <<< "$line"
        SRC[${NAME##name=}]=$LOC
done 

for line in $(grep slice /proc/$$/cgroup); do
        IFS=: read _ NAME LOC <<< "$line"
        DEST[${NAME##name=}]=$LOC
done

for type in ${!SRC[@]}; do
        from=/sys/fs/cgroup/${type}${SRC[$type]}
        to=/sys/fs/cgroup/$type/"${DEST[$type]}"/$(basename "${SRC[$type]}")

        echo $from "=>" $to
        mkdir -p $to
        for p in $(<$from/cgroup.procs); do
                echo $p > $to/cgroup.procs
        done
done

echo $PID > /var/run/test.pid

Then I wrote the following unit file

[Unit]
Description=My Service
After=docker.service
Requires=docker.service

[Service]
ExecStart=/opt/bin/docker-wrapper.sh run -d busybox /bin/sh -c "while true; do echo Hello World; sleep 1; done"
Type=forking
PIDFile=/var/run/test.pid

[Install]
WantedBy=multi-user.target

So what this does (and I know it's a hack, but I wanted to see if my proposal has any chance of working) is that after the container is launched, I look up the PID of the container and all of its cgroups. I then create child cgroups of the systemd cgroups and then move the PIDs from the original cgroups to the systemd child cgroups. After that is done I then write the PID of the container to a file. I end up with systemd cgroups being the parent, then a child cgroup under that. Looking something like below

  ├─test.service
  │ └─docker-8a0ff7503e0fca4f44d48f76a24cbcae82079818e3ad4d0d707ccf5765698184.scope
  │   ├─19103 /bin/sh -c while true; do echo Hello World; sleep 1; done
  │   └─19169 sleep 1

Also, since I told systemd to use a PIDFile, systemd is monitoring the PID 1 of the container because I wrote it to a file. So now if I do either docker stop or systemctl stop things just work (at least they seem to do) and I don't have a useless docker client hanging around in memory Now if you look at the script, you'll notice I'm just moving the PIDs, not the settings, so yeah, total hack that defeats the purpose of the original cgroup, but that's not the point right now.

Here's what I propose to make systemd and docker integration a tad bit better. When you want to run docker in a systemd unit you run docker run/start --yo-dawg-use-my-cgroups-as-your-parent ... which will read the current /proc/$$/cgroup of the client and pass it to dockerd. Dockerd now just creates its cgroups as a child of the cgroups passed in, if the subsystem exists. I think this means we can remove the systemd cgroup code and just use the cgroup fs based code (but docker will still have to write to the name=systemd fs). So now systemd can setup the parent cgroups however it wishes and Docker can setup the child cgroups how ever it wishes.

Is this the best solution? Probably not. But it seems a lot better than what we have today and it solves a current pain point.

Is this just plain stupid or already been thought of and shot down?

Vincent Batts · Answer 1 · Fri Jul 11 2014 21:11:47 GMT+0800 (China Standard Time)

From the [significant]discussion around systemd unit files in the contributor meeting yesterday https://botbot.me/freenode/docker-dev/msg/17771621/). The example unit files is @crosbymichael https://github.com/crosbymichael/.dotfiles/blob/master/systemd/redis.service

daurnimator · Answer 2 · Tue Sep 16 2014 00:35:58 GMT+0800 (China Standard Time)

@vbatts link to @crosbymichael 's repo is now missing. Old version available at https://github.com/crosbymichael/.dotfiles/blob/9b7e4dc76c912ff62b445c16f620aeb8f48e3cf6/systemd/redis.service

Darren Shepherd · Answer 3 · Tue Oct 14 2014 08:12:01 GMT+0800 (China Standard Time)

FYI, for anybody who stumbles upon this issue. I created https://github.com/ibuildthecloud/systemd-docker as an attempt to address the issues between docker and systemd.

Aris Pikeas · Answer 4 · Tue Jan 06 2015 15:31:51 GMT+0800 (China Standard Time)

Any new thoughts/movement on this?

Antoni Batchelli · Answer 5 · Fri Jan 23 2015 06:30:07 GMT+0800 (China Standard Time)

I have been using @ibuildthecloud's systemd-docker and the combo is a killer. Would be better if the issues it addresses issues were dealt with by docker itself

Lars Kellogg-Stedman · Answer 6 · Thu Jan 29 2015 12:25:00 GMT+0800 (China Standard Time)

This issue is hardly specific to systemd. It affects any environment in which someone wants to reliably start and monitor a container, which would include just about any non-SysV init system (systemd, upstart, runit, daemontools, launchd).

Igor Bukanov · Answer 7 · Sun Feb 01 2015 05:37:49 GMT+0800 (China Standard Time)

A simpler solution then using @ibuildthecloud's systemd-docker is to start a docker container in the background in ExecStartPre via run -d container or start container and then using ExecStart=/usr/bin/docker logs -f container. This way systemd, before starting any dependent units, waits until docker run -d or docker start returns and that happens only when the container is started. Then the logs command sends the initial startup logs to systemd and journal and then continue to do so as the new logs arrive until the container stops.

With this approach one also needs to put -/usr/bin/docker stop container both to ExecStop and ExecStopPost. The latter ensures that if /usr/bin/docker logs dies before the container terminates, then systemd still stops the container. Note that by just using ExecStopPost without ExecStop one will not get the termination logs into the journal as systemctl stop will kill the logs command before ExecStopPost stops the container.

Stuart P. Bentley · Answer 8 · Tue Jun 09 2015 07:33:14 GMT+0800 (China Standard Time)

as noted by ibuildthecloud/systemd-docker#25, #10427 helps with this

Peter Dyson · Answer 9 · Wed Oct 07 2015 12:25:12 GMT+0800 (China Standard Time)

Not that it's going to be an init system into the future, but using Upstart worked quite well for controlling docker containers for the most part in a simple config file per service that did everything you'd want.

Alexander Morozov · Answer 10 · Sat Sep 17 2016 00:24:33 GMT+0800 (China Standard Time)

It's still an issue if I understand correctly.

berglh · Answer 11 · Sat Feb 04 2017 13:31:14 GMT+0800 (China Standard Time)

Now that dockerd can be restarted with the --live-restore directive, if you have started containers with systemd, the docker client stops because the daemon is no longer available when restarting dockerd.

Even in @ibukanov example above, if the docker daemon restarts, the docker client will fail to connect to the daemon to get the logs and will cause the systemd unit to fail. Sure it might restart, but my goal is to have the container to continue running while being managed by systemd. Yes, the unit should require the docker daemon for startup, but once it's running, I want systemd to track the pid of the process launched by the container.

If I have the Restart=no directive set, the container will still run, logging of the docker client to journalctl will stop and the systemd unit will be in a failed state. If the unit file is set to Restart=on-failure, then the unit file will restart and either fail to start, because the container is already running or you force stop/rm old containers to prevent start-up problems using ExecStartPre=-/usr/bin/docker rm -f container.

This problem with systemd effectively stops you from making any decent use of the --live-restore option when managing the containers with unit files. I've tried looking at --cgroup-parent and using the systemd cgroup driver, but I am yet to see how this solves my problem. Sure systemd is aware of the cgroup, but it's not tracking the pid of the container, but the pid of the docker client that was used to launch the container.

I am unsure of my understanding in general around this behaviour, and there may be some example of structuring ExecStartPre, ExecStart, ExecStop and ExecStopPost to get the desired result. I'm going to read through @ibuildthecloud 's solution to this and see if I can come up with something less convoluted, but as far as I can see the issue still stands.

Igor Bukanov · Answer 12 · Sat Feb 04 2017 19:47:55 GMT+0800 (China Standard Time)

@berglh At this point I long gave up to trying to integrate docker with systemd. It just does not work due to very different approaches. So with docker I stick with its native commands using no unit files. In practice any dependency problems between containers can be solved with a shell script running in container that just waits until the condition is meat before starting the main application. Surprisingly this makes the whole setup much more robust and I have no problems with docker daemon restarts as it nicely restarts all my containers.

If systemd intgeration and unit files is a must, consider using runc, not docker itself, to run docker containers.

berglh · Answer 13 · Fri Feb 10 2017 12:52:29 GMT+0800 (China Standard Time)

@ibukanov I'll checkout runc for sure, but I'm currently using fleet and etcd on Oracle Enterprise Linux. Considering fleet is going to be officially not supported by CoreOS anymore, maybe I'm better off moving to Kubernetes or Openshift. The thing is that fleet is such a simple and straight forward concept of scheduling unit files, it's been attractive for the particular cluster I'm managing. Regardless, I'm going to have to probably move from fleet in the long run.

Ehsan Azar · Answer 14 · Fri Jun 23 2017 06:11:57 GMT+0800 (China Standard Time)

@ibukanov rkt can run docker images as-is, and works well with systemd.

Ben Boeckel · Answer 15 · Wed Oct 25 2017 04:58:16 GMT+0800 (China Standard Time)

Note that rkt currently requires that images be pushed to a registry, so running local images isn't going to work out of the box. See rkt/rkt#2392.

Alexander Holbreich · Answer 16 · Wed Apr 18 2018 18:11:49 GMT+0800 (China Standard Time)

It's April 2018. Is there any best practice to start containerized services with systemd?

If not, what again are the benefits of starting docker container as:

ExecStartPre=/usr/bin/docker run -d --name container1 some-image
ExecStart=/usr/bin/docker logs -f contaner1

instead of

-ExecStart=/usr/bin/docker run --name container1 some-image

?

Ehsan Azar · Answer 17 · Thu Apr 19 2018 02:42:35 GMT+0800 (China Standard Time)

@aholbreich the best way is to use rkt, sorry but docker does not play well with systemd. Unfortunately rkt is not popular.

Mike Pastore · Answer 18 · Thu Apr 19 2018 04:39:02 GMT+0800 (China Standard Time)

@aholbreich The former works well with SystemD; the latter does not. In order to use a docker run command as your ExecStart=, you have to use a wrapper like ibuildthecloud/systemd-docker. Either way, if you use SystemD, you can't use --live-restore as @berglh documented above.

Alexander Holbreich · Answer 19 · Thu Apr 19 2018 17:43:20 GMT+0800 (China Standard Time)

@mwpastore Ok, i understand wrapper and the --live-restore.
"The former works well with SystemD; the latter does not. " can you elaborate on that...
If i see it correctly this:

ExecStartPre=/usr/bin/docker run -d --name container1 some-image
ExecStart=/usr/bin/docker logs -f contaner1

is also not a real enabler of --live-restore, or? So any advantage here in these lines?

@dashesy i will consider rkt some day, but this out of scope now (for many reasons)

Mike Pastore · Answer 20 · Fri Apr 20 2018 03:24:44 GMT+0800 (China Standard Time)

@aholbreich This works with SystemD, but does not enable --live-restore:

ExecStartPre=/usr/bin/docker run -d --name container1 some-image
ExecStart=/usr/bin/docker logs -f contaner1

This does not work with SystemD; you need a wrapper, and even with a wrapper, it does not enable --live-restore:

ExecStart=/usr/bin/docker run --name container1 some-image

Alexander Holbreich · Answer 21 · Fri Apr 20 2018 05:04:39 GMT+0800 (China Standard Time)

This does not work with SystemD

Of course it works:
ExecStart=/usr/bin/docker run --name container1 some-image

Mike Pastore · Answer 22 · Fri Apr 20 2018 05:31:15 GMT+0800 (China Standard Time)

@aholbreich

Of course it works:

Please re-read the details of this issue and ibuildthecloud/systemd-docker#readme, and you will clearly see that—while SystemD does launch the process using that syntax—there's much more to it than that.

Alexander Holbreich · Answer 23 · Fri Apr 20 2018 16:24:37 GMT+0800 (China Standard Time)

I did. The initial problem is that Systemd monitors docker client and not the container.
How this is better in this case? I don't see it. In every line docker client is used.

ExecStartPre=/usr/bin/docker run -d --name container1 some-image
ExecStart=/usr/bin/docker logs -f contaner1

Igor Bukanov · Answer 24 · Sat Apr 21 2018 01:50:47 GMT+0800 (China Standard Time)

@aholbreich If docker client dies, with just ExecStart=/usr/bin/docker run systemd consider the unit as failed when the container in fact runs.

Alexander Holbreich · Answer 25 · Sat Apr 21 2018 04:21:14 GMT+0800 (China Standard Time)

@ibukanov ok, belive you & will try..
but strange that if "docker client dies"
this Commant should work further....

ExecStart=/usr/bin/docker logs -f contaner1

it's still docker client or not? & also don't kills container if dies... also wrong state. Why it works in this case?

Ehsan Azar · Answer 26 · Sat Apr 21 2018 05:54:42 GMT+0800 (China Standard Time)

I used this to tell systemd when client dies:

#!/bin/bash

function docker_cleanup {
    docker exec $IMAGE bash -c "if [ -f $PIDFILE ]; then kill -TERM -\$(cat $PIDFILE); rm $PIDFILE; fi"
}

IMAGE=$1
PIDFILE=/tmp/docker-exec-$$
shift
trap 'kill $PID; docker_cleanup $IMAGE $PIDFILE' TERM INT
docker exec $IMAGE bash -c "echo \"\$\$\" > $PIDFILE; exec $*" &
PID=$!
wait $PID
trap - TERM INT
wait $PID

One big problem with -d is that logs will not go to journald

Igor Bukanov · Answer 27 · Sat Apr 21 2018 06:14:33 GMT+0800 (China Standard Time)

@aholbreich See my comments above with ExecStop/ExecStopPost that ensures that the container stops when the client dies.

But these days if ever need to start a docker container from a systemd unit with docker, I will create the container outside systemd scripts in a provision script via docker create --restart=unless-stopped --log-driver=journald ... and use something like:

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=docker start mycontainer
ExecStop=docker stop mycontainer

This avoids running useless docker client, delegates restarting the container if it fails to Docker while still allowing to log to journald and letting systemd to start/stop the container to satisfy dependencies. The drawback is that stopping container via manual docker stop will not be reflected in systemd, but depending on deployment it can be even useful for debugging etc.

Alexander Holbreich · Answer 28 · Sun Apr 22 2018 02:37:33 GMT+0800 (China Standard Time)

Ok, with all the drawbacks of the proposed workarounds i'm gona continue direct use of:

ExecStart=/usr/bin/docker run --name container1 some-image

i think docker or systemd (probably systemd) should improve this in the future...
But for now, since i had no issues with this, causing any problem so far, don't see any reason to overcomplicate...

Gerald Pape · Answer 29 · Thu Jul 12 2018 20:51:29 GMT+0800 (China Standard Time)

Sorry for maybe asking an unrelated question, if this is the cast I'll happily delete my comment and ask the question somewhere more appropriate!

I assume this issue also applies to running containers via docker-compose? I sense docker-compose just amplifies everything by bringing another layer between the container and systemd?

Ben Boeckel · Answer 30 · Thu Aug 09 2018 20:14:15 GMT+0800 (China Standard Time)

Yes, the problem is that docker run isn't much more than a fancy communication over a socket to the docker.service process. The way systemd works, it assumes that the process under ExecStart is the service that is running. This isn't the way Docker works and neither project is very likely to change anything (IMO, there's nothing in systemd to "fix" and Docker doesn't want to have code which would make systemd understand what's going on). In the long run, using rkt (or at least a container runtime that behaves more…normally) is the better choice.

Gigadoc2 · Answer 31 · Fri Aug 10 2018 04:08:14 GMT+0800 (China Standard Time)

@ubergesundheit Yes, compose is already handling multiple "services", so running compose as a systemd service adds this conceptual mismatch on top of the other mentioned problems when running docker containers as systemd services.

FWIW, I think there are two main problems:

The first, lesser one, is that both systemd and docker want to manage cgroups. I can't really fault systemd for managing the cgroups, as it really needs to do so in order to provide the supervision capabilities I expect from a modern service manager. However, I also recognize that docker wants to do more with cgroups than systemds API might allow them to.

The systemd cgroup-driver is (was?) dockers solution for people who are willing to give up a bit of cgroup-related features in exchange to integrate docker and systemd better (with docker "controlling" systemd in this scenario). But as docker is favoring the cgroupfs driver (and I'm not even sure if the systemd driver is still available in current upstream docker), most systems will have docker and systemd managing cgroups in parallel. This currently kind-of works, but I believe that this won't be the case with the unified cgroup structure anymore. A proper solution might be to Delegate= a cgroup subtree to docker, but that probably requires a few changes to docker (changes of the sort that docker devs might be opposed to).

But the other, currently much bigger problem is that docker is designed to be a service manager similar to systemd, but does not provide a superset of systemds features (and is only capable of managing containers). Docker combines (at least) a service manager, a container runtime and a package manager (these components may have been split on a technical level, but from an operational perspective everything is still controlled by the one docker daemon). As the container runtime and the service manager parts are inextricably linked, it's by design pretty much impossible to (cleanly and elegantly) run docker "below" another service manager.

That in itself is actually not a problem: As long as the cgroup-problem above is solved, you can just run docker on your systemd-based system and use docker-commands instead of systemctl. I think that all the people happily using docker are doing just that. The "problem" here really just is that systemd is - at least in some aspects - a better service manager than docker. For one thing, systemd can manage regular processes and ones started through a pure container runtime (rkt, podman, just runc), so one can express dependencies between containers and regular processes - not possible for docker. And even if we consider "pure container systems", where there are no dependencies between system services and containers, some of us still prefer the dependency management of systemd (for example, I really prefer waiting for a service to declare readiness itself instead of pulling in an external "check script"). Also, I really like socket activation, and I think containers would especially profit from that.

So, there are multiple possible solutions to this problem, but they depend on how you think a system should be managed (or whether there even is a problem at all):

One approach would be to remove the service manager part from docker. That is, I believe, what RedHat is trying to do with their docker-fork Podman (or CoreOS with rkt), albeit more for the sake of intergrating with Kubernetes than systemd. That is my favored approach and I would use podman, were it not for the shortcomings of CNI (but that is really off-topic).

Another "solution" would be to just drop the notion that docker can be used with non-container software, or outside of a "dedicated container-server" scenario. Docker as it currently is already works well within that context, when you use systemd to just get docker up and running and only use docker from that point on. Though I'd personally like more dependency management than what docker+compose currently offer.

Theoretically, one could also try to extend dockers service-manager part, so it can manage non-containers as well. However, docker then would need to completely control systemd, which is possible, but would add way too much complexity and maintenance effort. And by abstracting systemds interface one would probably lose some features of it (just like docker loses features by going trough systemds cgroup interface).

This got way longer than I intended it to, but that is my layman assessment why there won't be a proper way to run docker from a unit file unless there is some significant change to dockers design. There might also be something systemd can do (I'm thinking about some extended interface to be "aware" of container runtimes and get supervision data from them), but in any case not without changes to docker as well.

Brian Goff · Answer 32 · Fri Aug 10 2018 05:22:27 GMT+0800 (China Standard Time)

So, the original issue is that docker run (or similar client strategies) are tied to the lifecycle of another daemon, which makes it difficult to to manage with systemd. This is just the design of Docker and it is unlikely to change.

containerd is much better suited for this, where the client is not tied to the lifecycle of another daemon.
So containerd's ctr utility should generally satisfy what's needed here, or a custom client can be made if doesn't do exactly what you want.

Another possible approach:
In the upcoming containerd 1.2 release there is also a new version of the contained shim API (v2), a new shim can be created that just defers all management to systemd... note such a shim does not exist today nor have I actually messed around with it, but it is certainly a possibility.

In any case, docker/moby is not the right place for this and containerd is very well suited for exactly this case, as such I am going to, respectfully, close this issue as it is no longer relevant unless Moby is massively redesigned (in which case it would be something new anyway, containerd does this today).

Thanks all for your interest, feel free to ping me on slack if you have any questions/concerns about this. 🙇 👼

Bill Metangmo · Answer 33 · Fri Feb 22 2019 03:42:50 GMT+0800 (China Standard Time)

I know the discussion is closed but I encountered this issue and I want to share with the posterity a snippet of the solution using runc as suggest by @cpuguy83 .

OS version

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.4 LTS
Release:        16.04
Codename:       xenial

Docker version


Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   9ee9f40
 Built:        Thu Apr 26 07:17:20 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   9ee9f40
  Built:        Thu Apr 26 07:15:30 2018
  OS/Arch:      linux/amd64
  Experimental: false

A solution using runC

No need to install runC as the version as ( for the previous version in my OS) , installing docker comes with a docker-runc executable.

runc version 1.0.0-rc5
commit: 4fc53a81fb7c994640722ac585fa9ca548971871
spec: 1.0.0

For docker-runc to run you have to provide him 2 things: a folder name rootfs which contains an export of the docker container you want to launch with runC and a config.json file which is a representation of all the arguments you give to docker engine when using docker run command but that follows the OCI format spec.

I provide to you links that helps me to do so:

After creating your rootfs directory and config.json , you can create your systemd configuration based on my template ( it works like a charm for me):

[Unit]
Description=<name> Container
After=docker.service
Requires=docker.service

[Service]
Type=forking
Restart=always
RestartSec=5s
WorkingDirectory=<the directory where rootfs and config.jon are>
ExecStart=/usr/bin/docker-runc run <name> --detach
ExecStop=/usr/bin/docker-runc delete --force <name>

[Install]
WantedBy=multi-user.target

Thanks !

daurnimator · Answer 34 · Sat Feb 23 2019 07:56:09 GMT+0800 (China Standard Time)

a folder name rootfs which contains an export of the docker container you want to launch with runC

You could probably use docker save | tar -x to create that in an ExecStartPre