Too many Cluster redirections when Azure Redis Cache has a failover

Question

Too many Cluster redirections when Azure Redis Cache has a failover

Toritos01 opened this issue 3 months ago · comments

Hello, there seems to be an issue with ioredis 5.3.2 when connected to an Azure Redis Cache. Initially, everything was mostly working fine when I use two shards. However, I noticed that when one of shard's primary node fails-over (this happens periodically in Azure Redis Cache for maintenance), it completely breaks the ioredis connection of my app that was connected at the time. The only way that I have been able to fix this is by using the "Reboot" feature in the Azure portal to make the other shard also failover, then both of them get fixed somehow.

I am able to consistently reproduce this failure by using the Azure portal "Reboot" to make one of the shard's Primary nodes failover. What I notice when this happens is that the shard that failed over completely stops receiving commands from my app (I can tell by looking at the "Monitor" command in the Redis Console for that shard). I can also tell that this is not an Azure Redis Cache issue because when I run a different test-app, I see some of those commands successfully go to the shard that failed over. This also tells me that the shard is already operational after the failover, but that ioredis still will not connect to it.

This seems to me like an issue where ioredis gets stuck in a redirect loop and cannot find its way back to the shard the failed over. The ioredis error that I see coming up look like this:
{"name":"Error","message":"Too many Cluster redirections. Last error: ReplyError: MOVED <IP>","stack":"Error: Too many Cluster redirections. Last error: ReplyError: MOVED <IP>"}
or also sometimes:
{"name":"Error","message":"Too many Cluster redirections. Last error: Error: Connection is closed.","stack":"Error: Too many Cluster redirections. Last error: Error: Connection is closed."}

Some additional context on my setup:
-Based on the logs, the app I am testing with is mostly sending out "rpop" commands repeatedly
-ioredis 5.3.2
-Node version 18
-Clustering enabled with 2 shards
-Azure Redis Cache v6
-SSL enabled, and using SSL port (6380)
-No multi-key operations across shards are occurring
-The constructor below is what I use to access the cluster:
const redisClient = new redis.Cluster([
{
port: 6380,
host: <hostname>,
},
], {
slotsRefreshTimeout: 50000,
dnsLookup: (address, callback) => callback(null, address),
showFriendlyErrorStack: true,
redisOptions: {
port: 6380,
host: <hostname>,
password: <password>,
connectTimeout: 20000,
enableReadyCheck: true,
maxRetriesPerRequest: 3,
enableOfflineQueue: true,
enableAutoPipelining: true,
autoPipeliningIgnoredCommands: ['ping'],
tls:
{
servername: <hostname>
}
},
});

I would mainly like to know if anyone knows why this happens, or any possible workarounds.

TL;DR:
Routine Azure Redis Cache failovers cause the "Too many Cluster redirections" error on a service that is running at the time of failover, by causing ioredis to stop sending commands to the shard that failed over, even after it recovers.