Deployment not adhering to update_config parallelism, stopping containers prematurely
ironhalik opened this issue · comments
I have a service with following deployment config:
Service Mode: Global
UpdateStatus:
State: updating
Started: About a minute ago
Message: update in progress
Placement:
UpdateConfig:
Parallelism: 1
On failure: rollback
Monitoring Period: 5s
Max failure ratio: 0
Update order: stop-first
The service runs in Global mode on three nodes and has a health check configured, which waits for an application server to respond. This usually takes about a minute.
healthcheck:
test: curl -sfA "healthcheck" 127.0.0.1/health
start_period: 120s
interval: 10s
timeout: 3s
retries: 3
Deployment is done using
docker stack deploy --prune --with-registry-auth -c ops/docker-compose.base.yml -c ${STACK_COMPOSE_FILE} ${STACK_NAME}
The problem is that during deployment, all the containers on all three nodes get stopped one after another without waiting for other/previous containers to reach a healthy state. This, of course, results in downtime.
Here's the service state during deployment. I have three containers in a "Preparing" state, and the rest were shut down.
# docker service ls
ccfp2dk17dao some_app global 0/3 registry.project.co/project/some_app/app:3d042a72
# docker service ps some_app
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
9sb43fxulrul some_app.kywz5ibig0bpibq39ngmle4in registry.project.co/project/some_app/app:3d042a72 backend01 Running Preparing 25 seconds ago
plvgnderyrft \_ some_app.kywz5ibig0bpibq39ngmle4in registry.project.co/project/some_app/app:1d6fef56 backend01 Shutdown Shutdown 11 seconds ago
4q0ozinrnygr \_ some_app.kywz5ibig0bpibq39ngmle4in registry.project.co/project/some_app/app:192856c4 backend01 Shutdown Shutdown 9 minutes ago
clihr34x9dbc some_app.tw3924jw42wlrpaqwtcl94z00 registry.project.co/project/some_app/app:3d042a72 backend02 Running Preparing 28 seconds ago
yxpzyheq00ye \_ some_app.tw3924jw42wlrpaqwtcl94z00 registry.project.co/project/some_app/app:1d6fef56 backend02 Shutdown Shutdown 23 seconds ago
44ldh3vfe7m8 \_ some_app.tw3924jw42wlrpaqwtcl94z00 registry.project.co/project/some_app/app:192856c4 backend02 Shutdown Shutdown 10 minutes ago
os1r1sd92vq1 some_app.yv3n4ex7514cxigx1oyj0hka3 registry.project.co/project/some_app/app:3d042a72 backend03 Running Preparing 58 seconds ago
ywctjs3psb2s \_ some_app.yv3n4ex7514cxigx1oyj0hka3 registry.project.co/project/some_app/app:1d6fef56 backend03 Shutdown Shutdown 54 seconds ago
9zf0vkeuuqny \_ some_app.yv3n4ex7514cxigx1oyj0hka3 registry.project.co/project/some_app/app:192856c4 backend03 Shutdown Shutdown 11 minutes ago
If I inspect one of the prematurely stopped containers, there's:
"State": {
"Status": "exited",
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 0,
"ExitCode": 0,
"Error": "",
"StartedAt": "2023-04-27T17:01:10.930540738Z",
"FinishedAt": "2023-04-27T17:12:22.720695291Z",
"Health": {
"Status": "unhealthy",
"FailingStreak": 0,
"Log": [
{
"Start": "2023-04-27T19:11:39.812486279+02:00",
"End": "2023-04-27T19:11:39.961464877+02:00",
"ExitCode": 0,
"Output": "OK!"
},
[...]
]
Interestingly enough, I have another service configured identically. Same health check, same deploy config, and on the same cluster. The only difference is that it's a different app and boots up a bit quicker. It deploys correctly.
all the containers on all three nodes get stopped one after another without waiting for other/previous containers to reach a healthy state.
I would expect that to happen, when you configure your upgrade strategy to be "stop-first". (-: