net/http: HTTP/2 response body Close method sometimes returns spurious context cancelation error (1.17.3 regression)
cespare opened this issue · comments
What version of Go are you using (go version
)?
Go 1.17.3
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (go env
)?
linux/amd64
What did you do?
- Use an
http.Client
with (any)Timeout
set - Make an HTTP/2 request
- Read the body
- Close the body
Demo code:
package main
import (
"io"
"log"
"net/http"
"time"
)
func main() {
client := &http.Client{Timeout: time.Minute}
for i := 0; i < 20; i++ {
resp, err := client.Get("https://google.com")
if err != nil {
log.Fatal(err)
}
if _, err := io.Copy(io.Discard, resp.Body); err != nil {
log.Fatal(err)
}
if err := resp.Body.Close(); err != nil {
log.Fatalf("[i=%d] body close error: %s", i, err)
}
}
}
(I'm pinging google.com as an easy way to get an HTTP/2 request going.)
What did you expect to see?
No error, at least not on the resp.Body.Close()
call.
What did you see instead?
For example:
2021/11/04 15:49:08 [i=0] body close error: context canceled
Sometimes it takes a few iterations to hit an error.
This is HTTP/2-specific: with GODEBUG=http2client=0
it does not reproduce.
This does not occur in Go 1.17.2 but it started happening with 1.17.3.
I realize it's weird to be checking the resp.Body.Close
error in the first place. This came up in the context of twirp; I'm going to send a PR to them to stop checking the error. Even so, I don't think we should get this error in the client. It's especially confusing because we get the error immediately (so it's not related to the 1 minute timeout specifically).
I bisected on master to 7109323, which is CL 353870. The original x/net/http2 CL is CL 353390 and it was backported for Go 1.17.3 as CL 357683.
resp.Body.Close
wants to wait until the request stream has been finalized. (This ensures that, among other things, if we immediately send another request, the previous request no longer counts against the connection's concurrency limit.)
Finalizing the stream (in cleanupWriteRequest
) can write to the network (sending a stream reset. Network writes can take an arbitrarily long amount of time, so resp.Body.Close
waits until the cleanup happens or until the request is canceled, whichever happens first.
A http.Client
with a non-zero timeout wraps the response body in a *http.cancelTimerBody
. The cancelTimerBody
cancels the request context after reading any error, including io.EOF
, from the response body.
So if you read the full response, the request context is always canceled before resp.Body.Close
is called.
I'm not immediately certain what the right fix is: resp.Body.Close
waits for the stream cleanup because we have tests which expect that all stream state is cleaned up after the response body is closed. This seems like a reasonable assumption, and a good property to preserve. But we don't want to block indefinitely, so there needs to be some bound on how long resp.Body.Close
will take. The context lifetime seems like the perfect bound, but we apparently can't use it.
Naive suggestion from someone without context:
If cancelTimerBody.Close
would return context.Canceled
and it had previously called stop
, then could it know that the latter caused the former and return nil
as a special case?
That fixes the problem where we return context.Canceled
instead of nil
, but it doesn't fix the problem where we want to wait for the request to be cleaned up before returning from Close
.
@neild I'm not sure I'm following you. Altering whether the function returns nil
or context.Canceled
doesn't change how long the function waits before returning.
Or are you saying that the problem I'm noticing is a symptom of a deeper problem, which is that resp.Body.Close
doesn't wait for stream cleanup as intended if there's a timeout set?
Yes, that's right: resp.Body.Close
is supposed to be waiting for stream cleanup, but it's returning immediately when the body has been completely read.
This just hit us too; sorry for the +1 post. We reverted to the previous Go version because it's not feasible to change every instance where we check such errors (and not clear that's the right thing to do anyway). This feels like a pretty serious regression.
Change https://golang.org/cl/362354 mentions this issue: http2: avoid spurious context cancelation error from Response.Body.Close
Change https://golang.org/cl/361919 mentions this issue: net/http: do not cancel request context on response body read
We also saw this error in our logs when we upgraded to 1.17.3. We downgraded the log to warn level after seeing this issue to remove the noise.
This issue is serious for me as I have a case when the entire body is lost before being processed.
This is happening when using httputil.DumpResponse to log the response body before returning it.
For example: https://github.com/nexmoinc/gosrvlib/blob/main/pkg/httpclient/client.go#L95 but with the body param set to true.
For now I solved by adding a Context Timeout.
@gopherbot Please open backport issues for 1.16 and 1.17
Backport issue(s) opened: #49558 (for 1.16), #49559 (for 1.17).
Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.
Change https://golang.org/cl/368084 mentions this issue: [release-branch.go1.16] net/http: do not cancel request context on response body read
Change https://golang.org/cl/368085 mentions this issue: [release-branch.go1.17] net/http: do not cancel request context on response body read