fullstorydev / grpcurl

Like cURL, but for gRPC: Command-line tool for interacting with gRPC servers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

grpcurl fails with "context deadline exceeded" after 10s if using plaintext when server expects TLS

ucarion opened this issue · comments

Bottom line up front, here's how you reproduce this issue:

$ grpcurl -version
grpcurl 1.8.6

$ time grpcurl -plaintext grpcb.in:9001 list
Failed to dial target host "grpcb.in:9001": context deadline exceeded
grpcurl -plaintext grpcb.in:9001 list  0.02s user 0.03s system 0% cpu 10.082 total

For context, grpcb.in:9001 wants TLS; -plaintext is the problem. But the fact that grpcurl hangs for 10 seconds, and does not produce an informative error, is the subject of this GitHub issue. I suspect the issue may be use of grpc.WithBlock() prevents an error from bubbling up, but I assume there's a good reason for the use of that dialopt for some other purpose.

I suspect the issue may be use of grpc.WithBlock() prevents an error from bubbling up,

I doubt that. That's actually how you get any error to bubble up. Otherwise, you never get any sort of feedback from Dial as it does the actual TCP connection setup completely asynchronous and only returns an error if there is some other configuration problem with the options.

The issue here is where it fails. In grpcurl.BlockingDial, we try to control both dialing and a potential TLS handshake so that we can intercept any errors (which the underlying gRPC Go runtime library hides from the application), in order to give a decent error to the user.

The issue here is actually that the connections are setup just fine -- all a plaintext connection cares about is getting the TCP connection. The other direction (using TLS in the client to a server that does not expect it) fails more cleanly because the error does bubble up from dialing because the connections cannot be established because the TLS handshake fails.

So the actual error is happening inside the gRPC runtime when it tries to send the HTTP/2 preface to the server. In this case, the server is expecting a TLS handshake, but doesn't receive one. So the server immediately closes the connection. We're providing a grpc. FailOnNonTempDialError(true) dial option, in the hopes that something like this would be bubbled up from the dial call. But apparently the server suddenly closing the connection (without any known reason) is interpreted as a temporary error. So the runtime keeps re-trying, creating a new connection over and over, never getting a healthy one that can be used for sending an RPC.

A fix is possible, but it isn't simple. The custom dialer in grpc.BlockingDial will need to wrap the returned net.Conn so it has more visibility into connection closures. So it could (for example) fail fast if it sees repeated inexplicable hang-ups from the server all before the grpc.Dial call completes (and it would have some sort of error to report, likely just "connection closed by peer").

The presence of a custom dialer does make things more unique here. In the past, I've just used the default dialer and matched against the returned error message, but I presume the custom dialer must remain as-is for other reasons.

I've just used the default dialer and matched against the returned error message

The custom dialer is actually only here to provide decent error messages. The "context deadline exceeded" error is what is coming from the grpc.Dial call, so "matched against the returned error message" wouldn't really help here. The custom dialers are only in place to intercept underlying network errors, so that we can use them to provide better error messages. The specific issue here is that the dialer is not instrumented to intercept all network errors -- we're missing out on whatever error is occurring after the connection is established, due to the server immediately closing the connection.

Yeah, sorry, I misspoke -- in the past I've matched against the RPC call error, rather than the dial error, for this situation. Whether an error is from dialing versus calling an RPC has always been confusing to me, and I suspect it's not even something stable across grpc-go versions.