grpcurl fails with "context deadline exceeded" after 10s if using plaintext when server expects TLS
ucarion opened this issue · comments
Bottom line up front, here's how you reproduce this issue:
$ grpcurl -version
grpcurl 1.8.6
$ time grpcurl -plaintext grpcb.in:9001 list
Failed to dial target host "grpcb.in:9001": context deadline exceeded
grpcurl -plaintext grpcb.in:9001 list 0.02s user 0.03s system 0% cpu 10.082 total
For context, grpcb.in:9001
wants TLS; -plaintext
is the problem. But the fact that grpcurl hangs for 10 seconds, and does not produce an informative error, is the subject of this GitHub issue. I suspect the issue may be use of grpc.WithBlock()
prevents an error from bubbling up, but I assume there's a good reason for the use of that dialopt for some other purpose.
I suspect the issue may be use of
grpc.WithBlock()
prevents an error from bubbling up,
I doubt that. That's actually how you get any error to bubble up. Otherwise, you never get any sort of feedback from Dial
as it does the actual TCP connection setup completely asynchronous and only returns an error if there is some other configuration problem with the options.
The issue here is where it fails. In grpcurl.BlockingDial
, we try to control both dialing and a potential TLS handshake so that we can intercept any errors (which the underlying gRPC Go runtime library hides from the application), in order to give a decent error to the user.
The issue here is actually that the connections are setup just fine -- all a plaintext connection cares about is getting the TCP connection. The other direction (using TLS in the client to a server that does not expect it) fails more cleanly because the error does bubble up from dialing because the connections cannot be established because the TLS handshake fails.
So the actual error is happening inside the gRPC runtime when it tries to send the HTTP/2 preface to the server. In this case, the server is expecting a TLS handshake, but doesn't receive one. So the server immediately closes the connection. We're providing a grpc. FailOnNonTempDialError(true)
dial option, in the hopes that something like this would be bubbled up from the dial call. But apparently the server suddenly closing the connection (without any known reason) is interpreted as a temporary error. So the runtime keeps re-trying, creating a new connection over and over, never getting a healthy one that can be used for sending an RPC.
A fix is possible, but it isn't simple. The custom dialer in grpc.BlockingDial
will need to wrap the returned net.Conn
so it has more visibility into connection closures. So it could (for example) fail fast if it sees repeated inexplicable hang-ups from the server all before the grpc.Dial
call completes (and it would have some sort of error to report, likely just "connection closed by peer").
The presence of a custom dialer does make things more unique here. In the past, I've just used the default dialer and matched against the returned error message, but I presume the custom dialer must remain as-is for other reasons.
I've just used the default dialer and matched against the returned error message
The custom dialer is actually only here to provide decent error messages. The "context deadline exceeded" error is what is coming from the grpc.Dial
call, so "matched against the returned error message" wouldn't really help here. The custom dialers are only in place to intercept underlying network errors, so that we can use them to provide better error messages. The specific issue here is that the dialer is not instrumented to intercept all network errors -- we're missing out on whatever error is occurring after the connection is established, due to the server immediately closing the connection.
Yeah, sorry, I misspoke -- in the past I've matched against the RPC call error, rather than the dial error, for this situation. Whether an error is from dialing versus calling an RPC has always been confusing to me, and I suspect it's not even something stable across grpc-go versions.
we often ran into this problem with grpc-go clients, and the newer WithReturnConnectionError
dial option is a nice alternative to WithBlock
and FailOnNonTempDialError
, because it bubbles up the underlying connection error. combined with some other recent improvements to the grpc-go client (i believe in v1.54.x), TLS handshake errors also show up now.
When attempting to use grpcurl to access a service deployed on an EC2 instance through a load balancer and target, using the following command: grpcurl -plaintext test.dev.xyz:9090 list
, I encounter an error. The error message states: "Failed to dial target host 'test.dev.xyz:9090': context deadline exceeded."
can anyone help me to resolve this ?
we often ran into this problem with grpc-go clients, and the newer
WithReturnConnectionError
dial option is a nice alternative toWithBlock
andFailOnNonTempDialError
, because it bubbles up the underlying connection error. combined with some other recent improvements to the grpc-go client (i believe in v1.54.x), TLS handshake errors also show up now.
This answer saved my day, thank you.
In my case the error was made because of cert expiration, and I couldn't even retrieve it correctly, WithBlock
simply stops in case of any error.