elastic / gmux

gmux supports serving net/http and grpc requests on a single port

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

handleGRPC can panic

simitt opened this issue · comments

(*mux).handleGRPC(..) occasionally panics with following error:

http: panic serving <ip:port>: send on closed channel
goroutine 3386993 [running]:
net/http.(*conn).serve.func1()
	net/http/server.go:1825 +0xbf
panic({0x557013b82be0, 0x557013f70d50})
	runtime/panic.go:844 +0x258
github.com/elastic/gmux.(*mux).handleGRPC(0x0?, 0x60?, {0x557013f99500?, 0xc0021f0cc0?}, 0xc003e135c0, {0x557013db38c0?, 0xc001fe1201?})
	github.com/elastic/gmux@v0.2.0/mux.go:300 +0x7d
github.com/elastic/gmux.(*mux).handleH2(0x20?, 0xc0000b8800?, {0x557013f98e20?, 0xc0007a0380?}, {0x557013f7a528, 0xc001fe1200})
	github.com/elastic/gmux@v0.2.0/mux.go:138 +0x519
github.com/elastic/gmux.ConfigureServer.func1(0xc002a3ab60, 0xc0007a0380, {0x557013f7a528?, 0xc001fe1200?})
	github.com/elastic/gmux@v0.2.0/mux.go:79 +0x51
net/http.(*conn).serve(0xc0027acb40, {0x557013f92370, 0xc004642180})
	net/http/server.go:1874 +0x1293
created by net/http.(*Server).Serve
	net/http/server.go:3071 +0x4db

I have not yet found how to reproduce it.

TLDR: In the rare occasion during shutdown where a new gprc connection comes in after grpcServer shutdown and before httpServer shutdown, the user program panics.

Managed to reliably reproduce the issue.

2023/01/10 11:03:43 http: panic serving 127.0.0.1:35366: send on closed channel
goroutine 28987 [running]:
net/http.(*conn).serve.func1()
        /home/carson/sdk/go1.19.3/src/net/http/server.go:1850 +0xbf
panic({0x8e3d20, 0xa3beb0})
        /home/carson/sdk/go1.19.3/src/runtime/panic.go:890 +0x262
github.com/elastic/gmux.(*mux).handleGRPC(...)
        /home/carson/projects/gmux/mux.go:300
github.com/elastic/gmux.(*mux).withGRPCInsecure.func1({0xa409a0, 0xc00166a380}, 0xc001b28300)
        /home/carson/projects/gmux/mux.go:110 +0x568
net/http.HandlerFunc.ServeHTTP(0x0?, {0xa409a0?, 0xc00166a380?}, 0x46518e?)
        /home/carson/sdk/go1.19.3/src/net/http/server.go:2109 +0x2f
net/http.serverHandler.ServeHTTP({0xc00265a7b0?}, {0xa409a0, 0xc00166a380}, 0xc001b28300)
        /home/carson/sdk/go1.19.3/src/net/http/server.go:2947 +0x30c
net/http.(*conn).serve(0xc0015b01e0, {0xa410e0, 0xc0001184e0})
        /home/carson/sdk/go1.19.3/src/net/http/server.go:1991 +0x607
created by net/http.(*Server).Serve
        /home/carson/sdk/go1.19.3/src/net/http/server.go:3102 +0x4db

This only happens in the rare occasion where user program is shutting down and a new grpc connection comes in between grpcServer shutdown and httpServer shutdown. This is caused by httpServer and grpcServer closing order in code using gmux. There is no bug inside gmux mux.go implementation.

The program panics when the servers are closed in this order in user code:

grpcServer.GracefulStop()
httpServer.Shutdown(context.Background())

This is because grpcServer.GracefulStop() closes the grpc listener channel mux.grpcConns while httpServer is still happily accepting grpc connections. When connections are accepted and tried to be placed into a closed channel mux.grpcConns, the program panics.

The solution is to reverse the closing order in user code, i.e.

httpServer.Shutdown(context.Background())
grpcServer.GracefulStop()

such that the gmux configured httpServer stops accepting connections first.

This is because grpcServer.GracefulStop() closes the grpc listener channel mux.grpcConns while httpServer is still happily accepting grpc connections. When connections are accepted and tried to be placed into a closed channel mux.grpcConns, the program panics.

@carsonip How is mux.grpcConns being closed? I don't see anything closing it in mux.go.

This is because grpcServer.GracefulStop() closes the grpc listener channel mux.grpcConns while httpServer is still happily accepting grpc connections. When connections are accepted and tried to be placed into a closed channel mux.grpcConns, the program panics.

@carsonip How is mux.grpcConns being closed? I don't see anything closing it in mux.go.

@axw It is not closed in mux.go. In the case of apm-server, it is closed in https://github.com/elastic/apm-server/blob/main/internal/beater/server.go#L222

@carsonip I don't mean closing the server, but the channel. You said:

This is because grpcServer.GracefulStop() closes the grpc listener channel mux.grpcConns

How does closing the server close the channel?

How does closing the server close the channel?

@axw Good question. It isn't immediately obvious.

In apm-server, on startup, we start the grpc server with

s.grpcServer.Serve(s.httpServer.grpcListener)

which calls grpc package server.go

func (s *Server) Serve(lis net.Listener) error {
	...
	ls := &listenSocket{Listener: lis}
	s.lis[ls] = true
	...
}

In apm-server, on shutdown,

s.grpcServer.GracefulStop()

which calls grpc package server.go

func (s *Server) GracefulStop() {
	...
	for lis := range s.lis {
		lis.Close()
	}
	...
}

which in turn calls chanListener.Close because lis is a gmux/conn.go chanListener which implements net.Listener,

func (l *chanListener) Close() error {
	l.closeOnce.Do(func() {
		close(l.conns)
	})
	return nil
}

Thanks @carsonip! Makes sense now.

I think it would still be good to fix in gmux, as it feels like a bit of a footgun. Not urgent, but can we leave this open?

Thanks @carsonip! Makes sense now.

I think it would still be good to fix in gmux, as it feels like a bit of a footgun. Not urgent, but can we leave this open?

Yes, let's leave this open. Would like to know how we can fix this properly, otherwise this will be a point to document in readme.

People usually say "only writers should close the channel" but in this case gmux has no idea whether the channel is closed / should be closed due to reasons including but not limited to shutdown. We may have to add something to signal shutdown to gmux so that it stops sending stuff into the channel.

Would like to know how we can fix this properly, otherwise this will be a point to document in readme.

Could we just never close the chanListener.conns channel? Instead we could have a chanListener.closed channel which is closed by chanListener.Close, and is only ever received from. chanListener.Accept would select on both <-conns and <-closed. handleGRPC may need to do the same, not sure now.

I have a draft of a fix in gmux at #16 although it may be a little ugly. CI isn't happy and tests are timing out for the PR and even on main.