RedisLabs / sentinel_tunnel

infrastructure:
redis-01
redis-sentinel-01

redis-02
redis-sentinel-02

redis-03
redis-sentinel-03

sentinel-tunnel

procedure:

turning off first host
sentinel-tunnel:
changed to another host has become master
returning back first host
sentinel-tunnel:
shows all nodes up and running
turning off second host
sentinel-tunnel:
changed to third host has become master
returning back second host
sentinel-tunnel:
shows all nodes up and running
turning off third host
sentinel-tunnel:
doesn't reply anything and restart of sentinel_tunnel is needed, after works normal

Randomly sentinel_tunnel stops answering after 6th or 7th try of turning off hosts with redis.
logs are:

err: failed read line from client
dial tcp 10.1.10.160:26379: connect: connection refused

and hangs forever

There's a clear bug in the st_sentinel_connection where it doesn't return any reply to the get_master_address_by_name request after successfully reconnecting to redis-sentinel:

sentinel_tunnel/st_sentinel_connection/st_sentinel_connection.go

Line 104 in 20e718c

continue

	for db_name := range c.get_master_address_by_name {
		addr, err, is_client_closed := c.getMasterAddrByNameFromSentinel(db_name)
		if err != nil {
			if !is_client_closed {
				c.get_master_address_by_name_reply <- ...
			}
			if !c.reconnectToSentinel() {
				c.get_master_address_by_name_reply <- ...
			}
			continue
		}
		c.get_master_address_by_name_reply <-...
	}

No reply is sent in the err != nill -> c.reconnectToSentinel() == true -> continue case, it just skips on to the next read from c.get_master_address_by_name.

This will block the handleConnection until the next connection triggers a new lookup, and then either of the two lookups will get the new response. If the second lookup is for a different database, then the original handleConnection may even end up connecting to the wrong database.

I wouldn't use this in production, and definitely not with multiple databases configured, at the risk of applications connecting to / writing to the wrong database.

The <-c.get_master_address_by_name_reply construct in GetAddressByDbName doesn't look very safe anyways, there's no ordering guarantees on the reply channel reads... if two goroutines do concurrent lookups for different databases, then they might randomly get either of the two lookup results, which might be for the wrong database:

sentinel_tunnel/st_sentinel_connection/st_sentinel_connection.go

Line 137 in 20e718c

reply := <-c.get_master_address_by_name_reply

func (c *Sentinel_connection) GetAddressByDbName(name string) (string, error) {
	c.get_master_address_by_name <- name
	reply := <-c.get_master_address_by_name_reply
	return reply.reply, reply.err
}

The only mitigating aspect here might be the use of unbuffered/blocking channels, which mean that the second c.get_master_address_by_name send might not proceed until the retrieveAddressByDbName has sent the reply... but the two goroutines calling GetAddressByDbName could still race on which gets to the get_master_address_by_name_reply first, although the race window will include the getMasterAddrByNameFromSentinel network call, so that's unlikely. the blocking c.get_master_address_by_name_reply send might preclude that race?

A better way to implement this kind of request-reply pattern is to have the client create and send a new reply chan, and have the handling goroutine close that chan.

sentinel_tunnel stops working after several random host failures

Randomly sentinel_tunnel stops answering after 6th or 7th try of turning off hosts with redis. logs are:

err: failed read line from client dial tcp 10.1.10.160:26379: connect: connection refused

Randomly sentinel_tunnel stops answering after 6th or 7th try of turning off hosts with redis.
logs are:

err: failed read line from client
dial tcp 10.1.10.160:26379: connect: connection refused