moby / swarmkit

A toolkit for orchestrating distributed systems at any scale. It includes primitives for node discovery, raft-based consensus, task scheduling and more.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deployment not adhering to update_config parallelism, stopping containers prematurely

ironhalik opened this issue · comments

I have a service with following deployment config:

Service Mode:	Global
UpdateStatus:
 State:		updating
 Started:	About a minute ago
 Message:	update in progress
Placement:
UpdateConfig:
 Parallelism:	1
 On failure:	rollback
 Monitoring Period: 5s
 Max failure ratio: 0
 Update order:      stop-first

The service runs in Global mode on three nodes and has a health check configured, which waits for an application server to respond. This usually takes about a minute.

    healthcheck:
      test: curl -sfA "healthcheck" 127.0.0.1/health
      start_period: 120s
      interval: 10s
      timeout: 3s
      retries: 3

Deployment is done using

docker stack deploy --prune --with-registry-auth -c ops/docker-compose.base.yml -c ${STACK_COMPOSE_FILE} ${STACK_NAME}

The problem is that during deployment, all the containers on all three nodes get stopped one after another without waiting for other/previous containers to reach a healthy state. This, of course, results in downtime.

Here's the service state during deployment. I have three containers in a "Preparing" state, and the rest were shut down.

# docker service ls
ccfp2dk17dao   some_app                       global       0/3        registry.project.co/project/some_app/app:3d042a72

# docker service ps some_app
ID             NAME                                           IMAGE                                                 NODE           DESIRED STATE   CURRENT STATE              ERROR                       PORTS
9sb43fxulrul   some_app.kywz5ibig0bpibq39ngmle4in       registry.project.co/project/some_app/app:3d042a72   backend01   Running         Preparing 25 seconds ago
plvgnderyrft    \_ some_app.kywz5ibig0bpibq39ngmle4in   registry.project.co/project/some_app/app:1d6fef56   backend01   Shutdown        Shutdown 11 seconds ago
4q0ozinrnygr    \_ some_app.kywz5ibig0bpibq39ngmle4in   registry.project.co/project/some_app/app:192856c4   backend01   Shutdown        Shutdown 9 minutes ago
clihr34x9dbc   some_app.tw3924jw42wlrpaqwtcl94z00       registry.project.co/project/some_app/app:3d042a72   backend02   Running         Preparing 28 seconds ago
yxpzyheq00ye    \_ some_app.tw3924jw42wlrpaqwtcl94z00   registry.project.co/project/some_app/app:1d6fef56   backend02   Shutdown        Shutdown 23 seconds ago
44ldh3vfe7m8    \_ some_app.tw3924jw42wlrpaqwtcl94z00   registry.project.co/project/some_app/app:192856c4   backend02   Shutdown        Shutdown 10 minutes ago
os1r1sd92vq1   some_app.yv3n4ex7514cxigx1oyj0hka3       registry.project.co/project/some_app/app:3d042a72   backend03   Running         Preparing 58 seconds ago
ywctjs3psb2s    \_ some_app.yv3n4ex7514cxigx1oyj0hka3   registry.project.co/project/some_app/app:1d6fef56   backend03   Shutdown        Shutdown 54 seconds ago
9zf0vkeuuqny    \_ some_app.yv3n4ex7514cxigx1oyj0hka3   registry.project.co/project/some_app/app:192856c4   backend03   Shutdown        Shutdown 11 minutes ago

If I inspect one of the prematurely stopped containers, there's:

        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2023-04-27T17:01:10.930540738Z",
            "FinishedAt": "2023-04-27T17:12:22.720695291Z",
            "Health": {
                "Status": "unhealthy",
                "FailingStreak": 0,
                "Log": [
                    {
                        "Start": "2023-04-27T19:11:39.812486279+02:00",
                        "End": "2023-04-27T19:11:39.961464877+02:00",
                        "ExitCode": 0,
                        "Output": "OK!"
                    },
                    [...]
                  ]

Interestingly enough, I have another service configured identically. Same health check, same deploy config, and on the same cluster. The only difference is that it's a different app and boots up a bit quicker. It deploys correctly.

all the containers on all three nodes get stopped one after another without waiting for other/previous containers to reach a healthy state.

I would expect that to happen, when you configure your upgrade strategy to be "stop-first". (-: