moby / swarmkit

A toolkit for orchestrating distributed systems at any scale. It includes primitives for node discovery, raft-based consensus, task scheduling and more.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Large snapshot causes adding a new manager to fail

xinfengliu opened this issue · comments

A large snapshot (e.g. a few hundred MB) causes adding a new manager to fail.

We used to have a fix #2458 , however it is not enough. There's also a SendTimeout which seems to use hardcoded 2 seconds in sendProcessMessage in manager/state/raft/transport/peer.go.

This issue can be easily reproduced. Steps are as below:

  • Create many large objects in swarm
for i in $(seq 1 500)
do
 dd if=/dev/urandom bs=900k count=1 2>/dev/null | docker config create foo${i} -
done
  • Trigger snapshotting
docker swarm update --snapshot-interval 1
docker network create -d overlay dummy
docker network rm dummy
docker swarm update --snapshot-interval 10000
  • Verify the snapshot is big enough
/var/lib/docker/swarm/raft/snap-v3-encrypted:
-rw-r--r--. 1 root root 461774425 Jan 31 11:54 000000000000000b-000000000000042e.snap
  • Add a new manager node.

You will see the dead loop in docker logs:

On the leader node:

Jan 31 11:57:50 centos7 dockerd[4644]: time="2023-01-31T11:57:50.651215634+08:00" level=error msg="error streaming message to peer" error=EOF
Jan 31 11:57:52 centos7 dockerd[4644]: time="2023-01-31T11:57:52.655983276+08:00" level=error msg="error streaming message to peer" error=EOF
Jan 31 11:57:54 centos7 dockerd[4644]: time="2023-01-31T11:57:54.660918294+08:00" level=error msg="error streaming message to peer" error=EOF

On the manager node that is newly added:

Jan 31 11:57:51 centos7-1 dockerd[1326]: time="2023-01-31T11:57:51.009851258+08:00" level=error msg="error while reading from stream" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Jan 31 11:57:53 centos7-1 dockerd[1326]: time="2023-01-31T11:57:53.014080429+08:00" level=error msg="error while reading from stream" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Jan 31 11:57:55 centos7-1 dockerd[1326]: time="2023-01-31T11:57:55.019443613+08:00" level=error msg="error while reading from stream" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"

I changed SendTimeout to 10s, rebuilt docker daemon, then tested again, this time adding manager node succeeded. (Given that the snapshot size is 460MB and the network bandwidth between nodes is 1Gb in my environment, 10s should enough)

if opts.SendTimeout == 0 {
opts.SendTimeout = 2 * time.Second
}

Is changing the default SendTimeout a reasonable fix? Any concerns?

I guess it would be useful to have this configurable for the future? Maybe something to add support for specifying in the join command in docker/swarmkit?

We're discussing a couple approaches on my team, once some more investigation is done I'll have @dperny update this issue.

OK, so I've been digging on this. We actually supposedly fixed this twice. Once as a kludge by simply increasing the timeout as listed here, and the second time removing that kludge and fixing it the correct way. Unfortunately, I believe the correct way has a bug and so it never actually properly fixed the problem.

Kludge fix: #2391
Correct fix: #2458

Basically, what the correct fix does is when a snapshot is being sent, we use the gRPC streaming functionality, open a streaming RPC, break the snapshot into bite-sized chunks, and then reassemble it on the other side. This circumvents both the gRPC message size limit and the timeout problem.

Unfortunately, these two lines are left in:

ctx, cancel := context.WithTimeout(ctx, p.tr.config.SendTimeout)
defer cancel()

You may notice that a timeout is set before we enter the loop where we're splitting up the snapshot. This means it applies to sending the entire snapshot, even if it is broken into chunks.

Why do we get context deadline exceeded on the receiver, though? Because of gRPC magic that propagates that timeout to the context in the receiver's stream, causing it to time out here:

recvdMsg, err = stream.Recv()
if err == io.EOF {
break
} else if err != nil {
log.G(stream.Context()).WithError(err).Error("error while reading from stream")
return err
}

The fix is simple: remove the timeout on the stream. I tested this by setting up pair of VMs with a 100Mbps network link and following the repro steps in the original post. Without any changes, I can confirm I observe exactly the stated behavior. By commenting out the two lines that set the context timeout and using that build of the binary on the two VMs, the second node successfully joins the cluster when following the same steps.

All that's left to do is verify that no overarching timeout on this send operation is necessary (and that removing the timeout is harmless) and figure out how the heck to write a test for this.