Graceful HTTP Server (in Golang)

Question

Graceful HTTP Server (in Golang)

ShevaXu opened this issue 7 years ago · comments

Graceful HTTP Server (in Golang)

If you are building a HTTP service, occational upgrade of binary and change of configuration are almost unavoidable. It might cause serious problems if you did not realize the server should shutdown/restart gracefully until the alarms sound!

Simply put, a gracefull server/service should be capable of:

ensuring all in-progress requests are handled properly or timed-out;
restarting itself without closing the listening socket, optionally with upgraded binary or changed config.

This idea first came to me as one of my collegues talking about Nginx hot-reload, then I found this blog post explains it quite well. But when I try to implement a basal version of it (here's my effort grace), I realize there are still a lot to be filled in, including some updates from the coming Go 1.8, and here comes this post to share my experiences :).

TL;DR

If all you need is closing the server regardless of the open connections, you can just kill the process with a standard unix signal. Thus to handle all requests received before the process exits, the signal have to be caught to trigger certain logics specified by the server.

A Go HTTP server is run as a forever-blocking goroutine (usually the main one), which internally performs a infinite loop in func (srv *Server) Serve(l net.Listener) error until error appears. So the shutdown logics should keep tracks of the completions of all open connections while stopping the main goroutine from exiting, which usually introduce another blocking.

If requires restart, just fork a new process inheriting the listening socket (through file descriptor) and starts accepting connections on it before the shutdown.

Server, Listener & Conn

Before diving into the details of the shutdown logics, let's first figure out how Go HTTP server works (HTTPS similarily).

Whether you starts your server by http.ListenAndServe or srv.ListenAndServe, it all comes down to srv.Serve(l):

func (srv *Server) Serve(l net.Listener) error {
	defer l.Close()
	...
	for {
		rw, e := l.Accept()
		if e != nil {
			...
			return e
		}
		...
		c := srv.newConn(rw)
		c.setState(c.rwc, StateNew) // before Serve can return
		go c.serve(ctx)
	}
}

l.Accept() waits for and returns the next connection (a net.Conn) to the listener, error from it is the only way to break out of the loop;
srv.newConn(c) converts a net.Conn to a internal conn which wraps the *Server and net.Conn;
after setting the connection state, go c.serve(ctx) dispatches a goroutine to handle the connection.

The listener provides a Close() method to cause breaking the loop:

type Listener interface {
	...
	// Close closes the listener.
	// Any blocked Accept operations will be unblocked and return errors.
	Close() error
}

Without other blocking codes, the main goroutine will return after srv.Serve(l) returns, thus terminating the process along with all other goroutines including those processing the open connections. This is the underlying reason of just-kill-the-server being not graceful.

Graceful Shutdown

So, the problem of graceful shutdown can be reduced to make the main goroutine wait/block until all connections got properly handled or timed-out. To do so, the server needs a way to track all the in-progress connections.

Periodic Polling in Go 1.8

The coming Go 1.8 ships with a graceful shutdown implementation (see this commit), which I think worth looking into the details.

First let's look at those added fields in Server and conn:

type Server struct {
	...
	inShutdown        int32     // accessed atomically (non-zero means we're in Shutdown)

	mu         sync.Mutex
	listeners  map[net.Listener]struct{}
	activeConn map[*conn]struct{}
	doneChan   chan struct{}
}

type conn struct {
	...
	curState atomic.Value // of ConnectionState
}

The Server uses two maps to hold the listeners and active connections, and each conn now holds its internal state (before 1.8, only Server provides a func(net.Conn, ConnState) hook invoked by func (c *conn) setState(nc net.Conn, state ConnState)). Every time the ConnState changed, activeConn map tracks it:

func (s *Server) trackConn(c *conn, add bool) {
	s.mu.Lock()
	defer s.mu.Unlock()
	if s.activeConn == nil {
		s.activeConn = make(map[*conn]struct{})
	}
	if add {
		s.activeConn[c] = struct{}{}
	} else {
		delete(s.activeConn, c)
	}
}

func (c *conn) setState(nc net.Conn, state ConnState) {
	srv := c.server
	switch state {
	case StateNew:
		srv.trackConn(c, true)
	case StateHijacked, StateClosed:
		srv.trackConn(c, false)
	}
	c.curState.Store(connStateInterface[state])
	if hook := srv.ConnState; hook != nil {
		hook(nc, state)
	}
}

The srv.Serve(l) method now tracks the listeners (similar to trackConn using listeners map) and try to identify the new ErrServerClosed:

func (srv *Server) Serve(l net.Listener) error {
	...
	srv.trackListener(l, true)
	defer srv.trackListener(l, false)

	...
	for {
		rw, e := l.Accept()
		if e != nil {
			select {
			case <-srv.getDoneChan():
				return ErrServerClosed
			default:
			}
			...

Finally, Server exposes two API to either close (immediately) or shutdown (gracefully) itself; the comments explain:

// Close immediately closes all active net.Listeners and connections,
// regardless of their state. For a graceful shutdown, use Shutdown.
func (s *Server) Close() error {
	s.mu.Lock()
	defer s.mu.Lock()
	s.closeDoneChanLocked()
	err := s.closeListenersLocked()
	for c := range s.activeConn {
		c.rwc.Close()
		delete(s.activeConn, c)
	}
	return err
}

// shutdownPollInterval is how often we poll for quiescence
// during Server.Shutdown. This is lower during tests, to
// speed up tests.
// Ideally we could find a solution that doesn't involve polling,
// but which also doesn't have a high runtime cost (and doesn't
// involve any contentious mutexes), but that is left as an
// exercise for the reader.
var shutdownPollInterval = 500 * time.Millisecond

// Shutdown gracefully shuts down the server without interrupting any
// active connections. Shutdown works by first closing all open
// listeners, then closing all idle connections, and then waiting
// indefinitely for connections to return to idle and then shut down.
// If the provided context expires before the shutdown is complete,
// then the context's error is returned.
func (s *Server) Shutdown(ctx context.Context) error {
	atomic.AddInt32(&s.inShutdown, 1)
	defer atomic.AddInt32(&s.inShutdown, -1)

	s.mu.Lock()
	lnerr := s.closeListenersLocked()
	s.closeDoneChanLocked()
	s.mu.Unlock()

	ticker := time.NewTicker(shutdownPollInterval)
	defer ticker.Stop()
	for {
		if s.closeIdleConns() {
			return lnerr
		}
		select {
		case <-ctx.Done():
			return ctx.Err()
		case <-ticker.C:
		}
	}
}

s.closeDoneChanLocked() is used to signal a ErrServerClosed; s.closeListenersLocked calls l.Close() for all s.listeners; and s.closeIdleConns() scan through all s.activeConn's state periodically:

// closeIdleConns closes all idle connections and reports whether the
// server is quiescent.
func (s *Server) closeIdleConns() bool {
	s.mu.Lock()
	defer s.mu.Unlock()
	quiescent := true
	for c := range s.activeConn {
		st, ok := c.curState.Load().(ConnState)
		if !ok || st != StateIdle {
			quiescent = false
			continue
		}
		c.rwc.Close()
		delete(s.activeConn, c)
	}
	return quiescent
}

In conclusion, Go 1.8 blocks when you call srv.Shutdown(ctx) explicitly and waits for the progress of each connection to complete by polling their states.

Other Ways to Track & Block

tylerb/graceful solved the problem by hooking the ConnState and extensively using channels to avoid most mutexes: connections are still tracked by a map and the progress blocks at channel receiving (inside srv.shutdown):

// Serve is equivalent to http.Server.Serve with graceful shutdown enabled.
func (srv *Server) Serve(listener net.Listener) error {
	...
	srv.Server.ConnState = func(conn net.Conn, state http.ConnState) {
		switch state {
		case http.StateNew:
			add <- conn
		case http.StateActive:
			active <- conn
		case http.StateIdle:
			idle <- conn
		case http.StateClosed, http.StateHijacked:
			remove <- conn
		}

		srv.stopLock.Lock()
		defer srv.stopLock.Unlock()

		if srv.ConnState != nil {
			srv.ConnState(conn, state)
		}
	}
	...
	go srv.handleInterrupt(interrupt, quitting, listener)

	// Serve with graceful listener.
	// Execution blocks here until listener.Close() is called, above.
	err := srv.Server.Serve(listener)
	...
	srv.shutdown(shutdown, kill)

The other solution is to utilize sync.WaitGroup where each accepted c net.Conn makes wg.Add(1) and each calls of c.Close() triggers wg.Done(), which is explained in the above post and used by package endless. It requires addional wraps for net.Listener and net.Conn, and contentious mutexes. It might also be a problem when a connection is hijacked (through the Hajacker interface which bypass all the cleanups including c.Close()).

Repo

https://github.com/ShevaXu/playground/tree/master/grace