sentinel_tunnel stops working after several random host failures
chukynax opened this issue · comments
infrastructure:
redis-01
redis-sentinel-01
redis-02
redis-sentinel-02
redis-03
redis-sentinel-03
sentinel-tunnel
procedure:
- turning off first host
sentinel-tunnel:
changed to another host has become master - returning back first host
sentinel-tunnel:
shows all nodes up and running - turning off second host
sentinel-tunnel:
changed to third host has become master - returning back second host
sentinel-tunnel:
shows all nodes up and running - turning off third host
sentinel-tunnel:
doesn't reply anything and restart of sentinel_tunnel is needed, after works normal
Randomly sentinel_tunnel stops answering after 6th or 7th try of turning off hosts with redis.
logs are:
err: failed read line from client
dial tcp 10.1.10.160:26379: connect: connection refused
and hangs forever
There's a clear bug in the st_sentinel_connection
where it doesn't return any reply to the get_master_address_by_name
request after successfully reconnecting to redis-sentinel:
for db_name := range c.get_master_address_by_name {
addr, err, is_client_closed := c.getMasterAddrByNameFromSentinel(db_name)
if err != nil {
if !is_client_closed {
c.get_master_address_by_name_reply <- ...
}
if !c.reconnectToSentinel() {
c.get_master_address_by_name_reply <- ...
}
continue
}
c.get_master_address_by_name_reply <-...
}
No reply is sent in the err != nill
-> c.reconnectToSentinel() == true
-> continue
case, it just skips on to the next read from c.get_master_address_by_name
.
This will block the handleConnection
until the next connection triggers a new lookup, and then either of the two lookups will get the new response. If the second lookup is for a different database, then the original handleConnection
may even end up connecting to the wrong database.
I wouldn't use this in production, and definitely not with multiple databases configured, at the risk of applications connecting to / writing to the wrong database.
The <-c.get_master_address_by_name_reply
construct in GetAddressByDbName
doesn't look very safe anyways, there's no ordering guarantees on the reply channel reads... if two goroutines do concurrent lookups for different databases, then they might randomly get either of the two lookup results, which might be for the wrong database:
func (c *Sentinel_connection) GetAddressByDbName(name string) (string, error) {
c.get_master_address_by_name <- name
reply := <-c.get_master_address_by_name_reply
return reply.reply, reply.err
}
The only mitigating aspect here might be the use of unbuffered/blocking channels, which mean that the second c.get_master_address_by_name
send might not proceed until the retrieveAddressByDbName
has sent the reply... but the two goroutines calling the blocking GetAddressByDbName
could still race on which gets to the get_master_address_by_name_reply
first, although the race window will include the getMasterAddrByNameFromSentinel
network call, so that's unlikely.c.get_master_address_by_name_reply
send might preclude that race?
A better way to implement this kind of request-reply pattern is to have the client create and send a new reply chan, and have the handling goroutine close that chan.