moby / swarmkit

A large snapshot (e.g. a few hundred MB) causes adding a new manager to fail.

We used to have a fix #2458 , however it is not enough. There's also a SendTimeout which seems to use hardcoded 2 seconds in sendProcessMessage in manager/state/raft/transport/peer.go.

This issue can be easily reproduced. Steps are as below:

Create many large objects in swarm

for i in $(seq 1 500)
do
 dd if=/dev/urandom bs=900k count=1 2>/dev/null | docker config create foo${i} -
done

Trigger snapshotting

docker swarm update --snapshot-interval 1
docker network create -d overlay dummy
docker network rm dummy
docker swarm update --snapshot-interval 10000

Verify the snapshot is big enough

/var/lib/docker/swarm/raft/snap-v3-encrypted:
-rw-r--r--. 1 root root 461774425 Jan 31 11:54 000000000000000b-000000000000042e.snap

Add a new manager node.

You will see the dead loop in docker logs:

On the leader node:

Jan 31 11:57:50 centos7 dockerd[4644]: time="2023-01-31T11:57:50.651215634+08:00" level=error msg="error streaming message to peer" error=EOF
Jan 31 11:57:52 centos7 dockerd[4644]: time="2023-01-31T11:57:52.655983276+08:00" level=error msg="error streaming message to peer" error=EOF
Jan 31 11:57:54 centos7 dockerd[4644]: time="2023-01-31T11:57:54.660918294+08:00" level=error msg="error streaming message to peer" error=EOF

On the manager node that is newly added:

Jan 31 11:57:51 centos7-1 dockerd[1326]: time="2023-01-31T11:57:51.009851258+08:00" level=error msg="error while reading from stream" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Jan 31 11:57:53 centos7-1 dockerd[1326]: time="2023-01-31T11:57:53.014080429+08:00" level=error msg="error while reading from stream" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Jan 31 11:57:55 centos7-1 dockerd[1326]: time="2023-01-31T11:57:55.019443613+08:00" level=error msg="error while reading from stream" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"

I changed SendTimeout to 10s, rebuilt docker daemon, then tested again, this time adding manager node succeeded. (Given that the snapshot size is 460MB and the network bandwidth between nodes is 1Gb in my environment, 10s should enough)

swarmkit/manager/state/raft/raft.go

Lines 220 to 222 in 20791b0

    
           if opts.SendTimeout == 0 { 
        
           	opts.SendTimeout = 2 * time.Second 
        
           }

Is changing the default SendTimeout a reasonable fix? Any concerns?

I guess it would be useful to have this configurable for the future? Maybe something to add support for specifying in the join command in docker/swarmkit?

We're discussing a couple approaches on my team, once some more investigation is done I'll have @dperny update this issue.

OK, so I've been digging on this. We actually supposedly fixed this twice. Once as a kludge by simply increasing the timeout as listed here, and the second time removing that kludge and fixing it the correct way. Unfortunately, I believe the correct way has a bug and so it never actually properly fixed the problem.

Kludge fix: #2391
Correct fix: #2458

Basically, what the correct fix does is when a snapshot is being sent, we use the gRPC streaming functionality, open a streaming RPC, break the snapshot into bite-sized chunks, and then reassemble it on the other side. This circumvents both the gRPC message size limit and the timeout problem.

Unfortunately, these two lines are left in:

swarmkit/manager/state/raft/transport/peer.go

Lines 199 to 200 in b7708a5

    
           ctx, cancel := context.WithTimeout(ctx, p.tr.config.SendTimeout) 
        
           defer cancel()

You may notice that a timeout is set before we enter the loop where we're splitting up the snapshot. This means it applies to sending the entire snapshot, even if it is broken into chunks.

Why do we get context deadline exceeded on the receiver, though? Because of gRPC magic that propagates that timeout to the context in the receiver's stream, causing it to time out here:

swarmkit/manager/state/raft/raft.go

Lines 1344 to 1350 in b7708a5

    
           recvdMsg, err = stream.Recv() 
        
           if err == io.EOF { 
        
           	break 
        
           } else if err != nil { 
        
           	log.G(stream.Context()).WithError(err).Error("error while reading from stream") 
        
           	return err 
        
           }

The fix is simple: remove the timeout on the stream. I tested this by setting up pair of VMs with a 100Mbps network link and following the repro steps in the original post. Without any changes, I can confirm I observe exactly the stated behavior. By commenting out the two lines that set the context timeout and using that build of the binary on the two VMs, the second node successfully joins the cluster when following the same steps.

All that's left to do is verify that no overarching timeout on this send operation is necessary (and that removing the timeout is harmless) and figure out how the heck to write a test for this.

	if opts.SendTimeout == 0 {
	opts.SendTimeout = 2 * time.Second
	}

	ctx, cancel := context.WithTimeout(ctx, p.tr.config.SendTimeout)
	defer cancel()

	recvdMsg, err = stream.Recv()
	if err == io.EOF {
	break
	} else if err != nil {
	log.G(stream.Context()).WithError(err).Error("error while reading from stream")
	return err
	}

Large snapshot causes adding a new manager to fail