docker volume plugin fails on reboot, due to docker's watchdog timeout

Question

docker volume plugin fails on reboot, due to docker's watchdog timeout

diablodale opened this issue 5 years ago · comments

Summary

Often after reboot of a docker host, Rexray managed docker volume plugins will fail to automatically enable. This results in disabled volume plugins, containers unable to access those volumes, and probably a cascade of catastrophic failures.

The cause I isolated was a default watchdog timer in Docker. The fix is to change the length of that timer. See below.

Bug Reports

Version

docker.io/rexray/ebs:0.11.4
docker.io/rexray/s3fs:0.11.4
(perhaps all of them)

Steps To Reproduce

Use AWS. However, other hosts, cloud services, and your own private servers may also reproduce this issue.

Create three S3 buckets
Create two EBS volumes
Create an IAM role that is granted permissions to access the S3 and EBS resources. Follow docs like https://rexray.readthedocs.io/en/stable/user-guide/storage-providers/aws/#troubleshooting
Create and start a t3.nano instance with the Amazon Linux 2 ECS-optimized AMI that uses the IAM role you created. This should provide you docker 18.06
SSH to that instance

Install the Rexray EBS and S3FS managed volume plugins. Replace us-east-1 with your AWS region. Change the s3 url and endpoint options if your region is not us-east-1.

docker plugin install --grant-all-permissions rexray/ebs:0.11.4 \
 REXRAY_LOGLEVEL=debug \
 LIBSTORAGE_INTEGRATION_VOLUME_OPERATIONS_MOUNT_ROOTPATH=/ \
 LINUX_VOLUME_ROOTPATH=/ \
 LINUX_VOLUME_FILEMODE=0750 \
 EBS_REGION=us-east-1
docker plugin install --grant-all-permissions rexray/s3fs:0.11.4 \
 REXRAY_LOGLEVEL=debug \
 LIBSTORAGE_INTEGRATION_VOLUME_OPERATIONS_MOUNT_ROOTPATH=/ \
 LINUX_VOLUME_ROOTPATH=/ \
 LINUX_VOLUME_FILEMODE=0750 \
 S3FS_REGION=us-east-1 \
 S3FS_OPTIONS="allow_other,umask=0027,noexec,mp_umask=0027,uid=0,gid=0,noatime,use_path_request_style,iam_role=auto,url=https://s3.amazonaws.com,endpoint=us-east-1"

Verify both plugins report they installed. Fix any issues before continuing to the next step.
Verify you can see the names of all your S3 and EBS volumes using docker volume ls. Fix any issues before continuing to the next step.
Reboot your instance using something like sudo reboot now
ssh to your instance
View the status of volume plugins with docker plugin ls

Actual Behavior

After reboot, some or all of the rexray docker volume plugins are disabled. ☹ If they are all enabled, keep rebooting...some will eventually be disabled.

/var/log/messages (or other logfiles if you didn't use the AMI described above) you find entries like the following. The most important entry for this scenario is the received signal; shutting down entry.


Sep  8 13:43:06 ip-10-10-35-252 dockerd: time="2019-09-08T13:43:06Z" level=error msg="time=\"2019-09-08T13:43:06Z\" level=info msg=\"received signal; shutting down\" signal=terminated time=1567950186194 " plugin=edbc0aa2bad9f7c519d465cfea75eb06e037195e30b2f5af7b74dd5f247bfc61
Sep  8 13:43:06 ip-10-10-35-252 dockerd: time="2019-09-08T13:43:06Z" level=error msg="time=\"2019-09-08T13:43:06Z\" level=info msg=\"received exit signal\" signal=terminated time=1567950186195 " plugin=edbc0aa2bad9f7c519d465cfea75eb06e037195e30b2f5af7b74dd5f247bfc61

Sep  8 13:43:15 ip-10-10-35-252 dockerd: time="2019-09-08T13:43:15Z" level=error msg="error: service startup failed: agent: mod init failed: context canceled" plugin=7043bd78c2f1c8b2972782fdb4381e0d90c54720c3801bd584e8d8d81ccc233c
Sep  8 13:43:16 ip-10-10-35-252 dockerd: time="2019-09-08T13:43:16.202499786Z" level=error msg="failed to enable plugin" error="dial unix /run/docker/plugins/7043bd78c2f1c8b2972782fdb4381e0d90c54720c3801bd584e8d8d81ccc233c/rexray.sock: connect: no such file or directory" id=7043bd78c2f1c8b2972782fdb4381e0d90c54720c3801bd584e8d8d81ccc233c

Expected Behavior

After reboot, all rexray docker volume plugins should be enabled.

Cause

After rebooting, the OS starts dockerd. dockerd has code to also start any plugins that were previously enabled. dockerd iterates through each of them and calls a function enable(p *v2.Plugin, c *controller, force bool).

This is the code in Docker 18.06 https://github.com/docker/docker-ce/blob/a464a87eb6d811607638ebcd9f186063b1b9b262/components/engine/plugin/manager_linux.go#L69-L82

The code then calls plugins.NewClientWithTimeout() with a default timeout of 30s. This timeout is an overarching watchdog timeout that will kill the plugin if that timeout is exceeded during the NewClient startup. This is mentioned in passing at https://docs.docker.com/engine/reference/commandline/plugin_enable/#options

The Rexray managed plugins like EBS and S3 do a lot of work during startup. This is a significant difference from some other plugins that do very little work during startup and instead do all their work on attach/mount.

During busy times on the instance (like reboots), the time needed for the services within the managed volume plugins to startup, have a fully working network, the API calls to EBS/S3 to complete, and results to be compiled might exceed this default 30s watchdog timer. If that occurs, then the watchdog calls shutdownPlugin() which then uses Signal(pluginID, int(unix.SIGTERM) to send a kill SIGTERM to the volume plugin.

While isolating this, I saw in the logs entries where the S3 https APIs returned results that looked ok. Yet I also saw the received signal entries which the caused the S3 managed plugin to shutdown. Naturally, this is because this watchdog timer naturally runs in parallel. The S3 plugin was running correctly, but it wasn't getting the result back to the watchdog code within the 30 seconds.

Fix

The easiest fix that worked for me is to make this watchdog timeout longer. Of course, this is a race condition and there are more robust solutions possible if the core Docker team were to make code changes.

This timeout can only be changed using the docker plugin enable command. There is no possibility with Docker 18.06 to use the docker plugin install command.

To change this watchdog timer from its default 30s to a longer timeout, use the following. This example is for the s3fs plugin and changing it to 120s. It applies also to other plugins and other timeouts might better meet your needs. Notice the use of the --disable and --timeout options.

docker plugin install --disable rexray/s3fs:0.11.4 [options...]
docker plugin enable --timeout 120 rexray/s3fs:0.11.4

Related issues

#912
#1191
moby/moby#37426