RedisLabs / sentinel_tunnel

A local proxy server for discovering and to Redis via Sentinel

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

sentinel_tunnel stops working after several random host failures

chukynax opened this issue · comments

infrastructure:
redis-01
redis-sentinel-01

redis-02
redis-sentinel-02

redis-03
redis-sentinel-03

sentinel-tunnel

procedure:

  1. turning off first host
    sentinel-tunnel:
    changed to another host has become master
  2. returning back first host
    sentinel-tunnel:
    shows all nodes up and running
  3. turning off second host
    sentinel-tunnel:
    changed to third host has become master
  4. returning back second host
    sentinel-tunnel:
    shows all nodes up and running
  5. turning off third host
    sentinel-tunnel:
    doesn't reply anything and restart of sentinel_tunnel is needed, after works normal

Randomly sentinel_tunnel stops answering after 6th or 7th try of turning off hosts with redis.
logs are:

err: failed read line from client
dial tcp 10.1.10.160:26379: connect: connection refused

and hangs forever

There's a clear bug in the st_sentinel_connection where it doesn't return any reply to the get_master_address_by_name request after successfully reconnecting to redis-sentinel:

	for db_name := range c.get_master_address_by_name {
		addr, err, is_client_closed := c.getMasterAddrByNameFromSentinel(db_name)
		if err != nil {
			if !is_client_closed {
				c.get_master_address_by_name_reply <- ...
			}
			if !c.reconnectToSentinel() {
				c.get_master_address_by_name_reply <- ...
			}
			continue
		}
		c.get_master_address_by_name_reply <-...
	}

No reply is sent in the err != nill -> c.reconnectToSentinel() == true -> continue case, it just skips on to the next read from c.get_master_address_by_name.

This will block the handleConnection until the next connection triggers a new lookup, and then either of the two lookups will get the new response. If the second lookup is for a different database, then the original handleConnection may even end up connecting to the wrong database.

I wouldn't use this in production, and definitely not with multiple databases configured, at the risk of applications connecting to / writing to the wrong database.

The <-c.get_master_address_by_name_reply construct in GetAddressByDbName doesn't look very safe anyways, there's no ordering guarantees on the reply channel reads... if two goroutines do concurrent lookups for different databases, then they might randomly get either of the two lookup results, which might be for the wrong database:

reply := <-c.get_master_address_by_name_reply

func (c *Sentinel_connection) GetAddressByDbName(name string) (string, error) {
	c.get_master_address_by_name <- name
	reply := <-c.get_master_address_by_name_reply
	return reply.reply, reply.err
}

The only mitigating aspect here might be the use of unbuffered/blocking channels, which mean that the second c.get_master_address_by_name send might not proceed until the retrieveAddressByDbName has sent the reply... but the two goroutines calling GetAddressByDbName could still race on which gets to the get_master_address_by_name_reply first, although the race window will include the getMasterAddrByNameFromSentinel network call, so that's unlikely. the blocking c.get_master_address_by_name_reply send might preclude that race?

A better way to implement this kind of request-reply pattern is to have the client create and send a new reply chan, and have the handling goroutine close that chan.