capnproto / go-capnp

Cap'n Proto library and code generator for Go

Home Page:https://capnproto.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Flaky RPC tests

zenhack opened this issue · comments

There are a few tests in the RPC package that usually pass, but fail intermittently:

  • TestRecvCancel (hangs)
  • TestSendCancel (maybe; see #318 (comment))
  • TestHandleReturn_regression (Sometimes hangs, sometimes actually fails, see #318 (comment) )
  • TestBootstrapReceiverAnswerRpc: see #318 (comment)
  • TestRecvAbort:
level0_test.go:148: 
        	Error Trace:	level0_test.go:148
        	Error:      	Received unexpected error:
        	            	context deadline exceeded
        	Test:       	TestRecvAbort
        	Messages:   	should resolve bootstrap capability

There may be one or two more. These crop up running the tests locally often enough to be easily reproducible but not every time, so at some point they managed to slip past CI and get merged. We should fix these; I don't want to keep having to wonder when working on something else whether I've broken something or a test is just flaky; this slows down development across the board.

Observations:

  • TestRecvCancel only ever hangs when run in parallel with other tests. It passes consistently if you either do go test -run TestRecvCancel or delete the t.Parallel() at the top of the function. I don't yet understand why that would happen; it doesn't look like there are any shared resources with other tests that might interfere, and there's no timeouts or anything...
  • The timeout in TestRecvAbort trigger the above failure isn't theoretically impossible, though a 1 second timeout seems generous. I'm having trouble reproducing the issue right now, but if I was able to reproduce it semi-reliably, my first thought would be to try commenting out t.Parallel() to see if that fixes anything.

It seems likely that some of these are race conditions that are sufficiently timing dependent that the noise from other tests is needed to trigger them.


Also, I managed to trigger a failure other than the above:

--- FAIL: TestBootstrapReceiverAnswerRpc (0.00s)
panic: test.capnp:CapArgsTest.call: rpc: connection closed [recovered]
	panic: test.capnp:CapArgsTest.call: rpc: connection closed

goroutine 27 [running]:
testing.tRunner.func1.2({0x667820, 0xc000586060})
	/usr/lib/go/src/testing/testing.go:1396 +0x24e
testing.tRunner.func1()
	/usr/lib/go/src/testing/testing.go:1399 +0x39f
panic({0x667820, 0xc000586060})
	/usr/lib/go/src/runtime/panic.go:884 +0x212
capnproto.org/go/capnp/v3/rpc.chkfatal(...)
	/home/isd/src/go-wk/go-capnproto2/rpc/receiveranswer_test.go:51
capnproto.org/go/capnp/v3/rpc.TestBootstrapReceiverAnswerRpc(0xc0000eed00)
	/home/isd/src/go-wk/go-capnproto2/rpc/receiveranswer_test.go:87 +0x4f3
testing.tRunner(0xc0000eed00, 0x6ca630)
	/usr/lib/go/src/testing/testing.go:1446 +0x10b
created by testing.(*T).Run
	/usr/lib/go/src/testing/testing.go:1493 +0x35f
exit status 2
FAIL	capnproto.org/go/capnp/v3/rpc	0.008s

TestHandleReturn_regression sometimes also fails (without hanging), with:

--- FAIL: TestHandleReturn_regression (0.00s)
    --- FAIL: TestHandleReturn_regression/BootstrapWithExpiredContext (0.00s)
        level0_test.go:1820: 
            	Error Trace:	level0_test.go:1820
            	Error:      	Target error should be in err chain:
            	            	expected: "context canceled"
            	            	in chain: "test.capnp:PingPong.echoNum: rpc: send message: rpc: connection closed"
            	            		"test.capnp:PingPong.echoNum: rpc: send message: rpc: connection closed"
            	            		"rpc: send message: rpc: connection closed"
            	            		"send message: rpc: connection closed"
            	            		"rpc: connection closed"
            	            		"connection closed"
            	Test:       	TestHandleReturn_regression/BootstrapWithExpiredContext
FAIL
exit status 1
FAIL	capnproto.org/go/capnp/v3/rpc	0.107s

When it hangs instead, it's on line 1816, which is a call to .Struct() on a result.

@lthibault, #320 fixes both failure modes for TestHandleReturn_regression.

A trace for the TestBootstrapReceiverAnswerRpc is below. It looks like the client is sending a finish message for the bootstrap it's done with, and then getting confused when it finally gets the return message because it's forgotten about the question entirely. Probably there's somewhere where we're being too aggressive about removing stuff from the questions table and need to leave some record of this. I think we should only ever remove stuff from the table in two places:

  1. handleReturn
  2. when sending the initial question fails (and thus we won't receive a return)

I'll pick this back up tomorrow, and start by auditing places where we remove stuff from questions.

( "Send"
, Message
    { union' =
        Message'bootstrap
          Bootstrap { questionId = 0 , deprecatedObjectId = Nothing }
    }
)
( "Send"
, Message
    { union' =
        Message'call
          Call
            { questionId = 1
            , target =
                MessageTarget
                  { union' =
                      MessageTarget'promisedAnswer
                        PromisedAnswer { questionId = 0 , transform = [] }
                  }
            , interfaceId = 13288942172782596300
            , methodId = 0
            , params =
                Payload
                  { content = Nothing
                  , capTable =
                      [ CapDescriptor
                          { attachedFd = 255
                          , union' =
                              CapDescriptor'receiverAnswer
                                PromisedAnswer { questionId = 0 , transform = [] }
                          }
                      ]
                  }
            , sendResultsTo =
                Call'sendResultsTo' { union' = Call'sendResultsTo'caller }
            , allowThirdPartyTailCall = False
            }
    }
)
( "Send"
, Message
    { union' =
        Message'finish Finish { questionId = 0 , releaseResultCaps = True }
    }
)
( "Recv"
, Message
    { union' =
        Message'return
          Return
            { answerId = 0
            , releaseParamCaps = False
            , union' =
                Return'results
                  Payload
                    { content = Nothing
                    , capTable =
                        [ CapDescriptor
                            { attachedFd = 255 , union' = CapDescriptor'senderHosted 0 }
                        ]
                    }
            }
    }
)
( "Send"
, Message
    { union' =
        Message'abort
          Exception
            { reason = "rpc: incoming return: question 0 does not exist"
            , obsoleteIsCallersFault = False
            , obsoleteDurability = 0
            , type_ = Exception'Type'failed
            }
    }
)

TestSendCancel failed in CI: https://github.com/capnproto/go-capnproto2/actions/runs/3238937452/jobs/5307722709#step:5:1483

...I think I've seen that before, so we should investigate that here as well.

#322 fixes TestBootstrapReceiverAnswerRpc.

#324 fixes TestRecvCancel. I'll tackle the rest tomorrow.

#327 fixes TestRecvAbort.

Closing, since all the relevant prs have been merged.