RoundRobin Load Balancing: When a node is down, all 1/N requests always fail.

Question

RoundRobin Load Balancing: When a node is down, all 1/N requests always fail.

lseelenbinder opened this issue 4 years ago · comments

Due to how the RoundRobin(Sync) is configured, whenever one of the backing nodes is down because of outages or maintenance, all requests that would be routed to that R2D2 pool fail (because that pool has no live connections and cannot create anymore).

This is a blocking bug to using the RoundRobin load balancing mechanism, in my opinion, since it removes all possibility of failover to another node, without implementing somewhat complex logic in the client.

Was this a known limitation I overlooked or should we look into adjusting the implementation so perhaps the collection of known nodes is used equally and when one is down, others can be used?

Alex Pikalov · Answer 1 · Sun Feb 09 2020 02:20:43 GMT+0800 (China Standard Time)

Hi @lseelenbinder ,
Yes, it was overlooked. Will try to come up with a solution for that bug.
Thanks for reporting

Luke Seelenbinder · Answer 2 · Sun Feb 09 2020 22:36:48 GMT+0800 (China Standard Time)

No problem, @AlexPikalov.

After I realized what was happening, I knew it was a bit of an edge case that wouldn't be too easy to accidentally replicate during testing, but quite common in production because machines are always coming and going during maintenance.

We're just going to revert to SingleNode and use HAProxy to load balance the actual instances, so this isn't a blocker for us to go into production.

Alex Pikalov · Answer 3 · Mon Feb 10 2020 00:35:42 GMT+0800 (China Standard Time)

@lseelenbinder
Good to know that it doesn't block you. However I think it's a good occasion to implement a feature that was requested almost 2 years ago #113 The solution itself may be based on Cassandra server events, namely on topology change: removed node. So if load balancer will remove a node reacting on this event, it may help to avoid the situation when load balancer returns a dead node

Luke Seelenbinder · Answer 4 · Mon Feb 10 2020 01:52:22 GMT+0800 (China Standard Time)

@AlexPikalov, that's a great idea!

My only concern is keeping the ability to limit which nodes a specific config would ever connect to, regardless of added or removed nodes (even if that means it has no live nodes to talk to).

Alex Pikalov · Answer 5 · Mon Mar 02 2020 22:40:05 GMT+0800 (China Standard Time)

Hi @lseelenbinder ,
I've just completed a draft implementation of dynamic clusters in #313.

These changes will remove dead node from cluster load balancing basing on received Topology Change event received from a node. I'm about to test it, but would really appreciate if you could check it from your end if it solves your case. Here is the new session factory function that will include this logic https://github.com/AlexPikalov/cdrs/blob/feat/113/src/cluster/session.rs#L236.

Comparing to new it has an extra argument event_src: NodeTcpConfig<'a, A> which is a configuration for a node that will be used as an event source.

Alex Pikalov · Answer 6 · Tue Mar 03 2020 17:20:29 GMT+0800 (China Standard Time)

So far, I've been able to find some issues with a proposed solution. Fixing it

Luke Seelenbinder · Answer 7 · Tue Mar 03 2020 20:39:49 GMT+0800 (China Standard Time)

Hi @AlexPikalov,

Thanks for fixing this! I won't have a chance to test it for a few days, but a one thing about the design is confusing for me.

NodeTcpConfig implies a single node is the source for the events, which, in my mind, doesn't actually help us any, since we still have a single point of failure. If that node happens to fail (or go down for maintenance, in the more likely scenario), we're still in the same position as before where one node failing causes issues across the cluster. Am I missing something in how it's intended to be used or how it works?

Our method of using HAProxy to balance local DC nodes is working quite well, and it looks like this method would probably require us to continue doing that for the event source.