Graceful HTTP Server (in Golang)
ShevaXu opened this issue · comments
Graceful HTTP Server (in Golang)
If you are building a HTTP service, occational upgrade of binary and change of configuration are almost unavoidable. It might cause serious problems if you did not realize the server should shutdown/restart gracefully until the alarms sound!
Simply put, a gracefull server/service should be capable of:
- ensuring all in-progress requests are handled properly or timed-out;
- restarting itself without closing the listening socket, optionally with upgraded binary or changed config.
This idea first came to me as one of my collegues talking about Nginx hot-reload, then I found this blog post explains it quite well. But when I try to implement a basal version of it (here's my effort grace), I realize there are still a lot to be filled in, including some updates from the coming Go 1.8, and here comes this post to share my experiences :).
TL;DR
If all you need is closing the server regardless of the open connections, you can just kill the process with a standard unix signal. Thus to handle all requests received before the process exits, the signal have to be caught to trigger certain logics specified by the server.
A Go HTTP server is run as a forever-blocking goroutine (usually the main one), which internally performs a infinite loop in func (srv *Server) Serve(l net.Listener) error
until error appears. So the shutdown logics should keep tracks of the completions of all open connections while stopping the main goroutine from exiting, which usually introduce another blocking.
If requires restart, just fork a new process inheriting the listening socket (through file descriptor) and starts accepting connections on it before the shutdown.
Server, Listener & Conn
Before diving into the details of the shutdown logics, let's first figure out how Go HTTP server works (HTTPS similarily).
Whether you starts your server by http.ListenAndServe
or srv.ListenAndServe
, it all comes down to srv.Serve(l)
:
func (srv *Server) Serve(l net.Listener) error {
defer l.Close()
...
for {
rw, e := l.Accept()
if e != nil {
...
return e
}
...
c := srv.newConn(rw)
c.setState(c.rwc, StateNew) // before Serve can return
go c.serve(ctx)
}
}
l.Accept()
waits for and returns the next connection (anet.Conn
) to the listener, error from it is the only way to break out of the loop;srv.newConn(c)
converts anet.Conn
to a internalconn
which wraps the*Server
andnet.Conn
;- after setting the connection state,
go c.serve(ctx)
dispatches a goroutine to handle the connection.
The listener provides a Close()
method to cause breaking the loop:
type Listener interface {
...
// Close closes the listener.
// Any blocked Accept operations will be unblocked and return errors.
Close() error
}
Without other blocking codes, the main goroutine will return after srv.Serve(l)
returns, thus terminating the process along with all other goroutines including those processing the open connections. This is the underlying reason of just-kill-the-server being not graceful.
Graceful Shutdown
So, the problem of graceful shutdown can be reduced to make the main goroutine wait/block until all connections got properly handled or timed-out. To do so, the server needs a way to track all the in-progress connections.
Periodic Polling in Go 1.8
The coming Go 1.8 ships with a graceful shutdown implementation (see this commit), which I think worth looking into the details.
First let's look at those added fields in Server
and conn
:
type Server struct {
...
inShutdown int32 // accessed atomically (non-zero means we're in Shutdown)
mu sync.Mutex
listeners map[net.Listener]struct{}
activeConn map[*conn]struct{}
doneChan chan struct{}
}
type conn struct {
...
curState atomic.Value // of ConnectionState
}
The Server
uses two maps to hold the listeners and active connections, and each conn
now holds its internal state (before 1.8, only Server
provides a func(net.Conn, ConnState)
hook invoked by func (c *conn) setState(nc net.Conn, state ConnState)
). Every time the ConnState
changed, activeConn
map tracks it:
func (s *Server) trackConn(c *conn, add bool) {
s.mu.Lock()
defer s.mu.Unlock()
if s.activeConn == nil {
s.activeConn = make(map[*conn]struct{})
}
if add {
s.activeConn[c] = struct{}{}
} else {
delete(s.activeConn, c)
}
}
func (c *conn) setState(nc net.Conn, state ConnState) {
srv := c.server
switch state {
case StateNew:
srv.trackConn(c, true)
case StateHijacked, StateClosed:
srv.trackConn(c, false)
}
c.curState.Store(connStateInterface[state])
if hook := srv.ConnState; hook != nil {
hook(nc, state)
}
}
The srv.Serve(l)
method now tracks the listeners (similar to trackConn
using listeners
map) and try to identify the new ErrServerClosed
:
func (srv *Server) Serve(l net.Listener) error {
...
srv.trackListener(l, true)
defer srv.trackListener(l, false)
...
for {
rw, e := l.Accept()
if e != nil {
select {
case <-srv.getDoneChan():
return ErrServerClosed
default:
}
...
Finally, Server
exposes two API to either close (immediately) or shutdown (gracefully) itself; the comments explain:
// Close immediately closes all active net.Listeners and connections,
// regardless of their state. For a graceful shutdown, use Shutdown.
func (s *Server) Close() error {
s.mu.Lock()
defer s.mu.Lock()
s.closeDoneChanLocked()
err := s.closeListenersLocked()
for c := range s.activeConn {
c.rwc.Close()
delete(s.activeConn, c)
}
return err
}
// shutdownPollInterval is how often we poll for quiescence
// during Server.Shutdown. This is lower during tests, to
// speed up tests.
// Ideally we could find a solution that doesn't involve polling,
// but which also doesn't have a high runtime cost (and doesn't
// involve any contentious mutexes), but that is left as an
// exercise for the reader.
var shutdownPollInterval = 500 * time.Millisecond
// Shutdown gracefully shuts down the server without interrupting any
// active connections. Shutdown works by first closing all open
// listeners, then closing all idle connections, and then waiting
// indefinitely for connections to return to idle and then shut down.
// If the provided context expires before the shutdown is complete,
// then the context's error is returned.
func (s *Server) Shutdown(ctx context.Context) error {
atomic.AddInt32(&s.inShutdown, 1)
defer atomic.AddInt32(&s.inShutdown, -1)
s.mu.Lock()
lnerr := s.closeListenersLocked()
s.closeDoneChanLocked()
s.mu.Unlock()
ticker := time.NewTicker(shutdownPollInterval)
defer ticker.Stop()
for {
if s.closeIdleConns() {
return lnerr
}
select {
case <-ctx.Done():
return ctx.Err()
case <-ticker.C:
}
}
}
s.closeDoneChanLocked()
is used to signal a ErrServerClosed
; s.closeListenersLocked
calls l.Close()
for all s.listeners
; and s.closeIdleConns()
scan through all s.activeConn
's state periodically:
// closeIdleConns closes all idle connections and reports whether the
// server is quiescent.
func (s *Server) closeIdleConns() bool {
s.mu.Lock()
defer s.mu.Unlock()
quiescent := true
for c := range s.activeConn {
st, ok := c.curState.Load().(ConnState)
if !ok || st != StateIdle {
quiescent = false
continue
}
c.rwc.Close()
delete(s.activeConn, c)
}
return quiescent
}
In conclusion, Go 1.8 blocks when you call srv.Shutdown(ctx)
explicitly and waits for the progress of each connection to complete by polling their states.
Other Ways to Track & Block
tylerb/graceful solved the problem by hooking the ConnState
and extensively using channels to avoid most mutexes: connections are still tracked by a map and the progress blocks at channel receiving (inside srv.shutdown
):
// Serve is equivalent to http.Server.Serve with graceful shutdown enabled.
func (srv *Server) Serve(listener net.Listener) error {
...
srv.Server.ConnState = func(conn net.Conn, state http.ConnState) {
switch state {
case http.StateNew:
add <- conn
case http.StateActive:
active <- conn
case http.StateIdle:
idle <- conn
case http.StateClosed, http.StateHijacked:
remove <- conn
}
srv.stopLock.Lock()
defer srv.stopLock.Unlock()
if srv.ConnState != nil {
srv.ConnState(conn, state)
}
}
...
go srv.handleInterrupt(interrupt, quitting, listener)
// Serve with graceful listener.
// Execution blocks here until listener.Close() is called, above.
err := srv.Server.Serve(listener)
...
srv.shutdown(shutdown, kill)
The other solution is to utilize sync.WaitGroup
where each accepted c net.Conn
makes wg.Add(1)
and each calls of c.Close()
triggers wg.Done()
, which is explained in the above post and used by package endless. It requires addional wraps for net.Listener
and net.Conn
, and contentious mutexes. It might also be a problem when a connection is hijacked (through the Hajacker
interface which bypass all the cleanups including c.Close()
).