AlexPikalov / cdrs

Cassandra DB native client written in Rust language. Find 1.x versions on https://github.com/AlexPikalov/cdrs/tree/v.1.x Looking for an async version? - Check WIP https://github.com/AlexPikalov/cdrs-async

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RoundRobin Load Balancing: When a node is down, all 1/N requests always fail.

lseelenbinder opened this issue · comments

Due to how the RoundRobin(Sync) is configured, whenever one of the backing nodes is down because of outages or maintenance, all requests that would be routed to that R2D2 pool fail (because that pool has no live connections and cannot create anymore).

This is a blocking bug to using the RoundRobin load balancing mechanism, in my opinion, since it removes all possibility of failover to another node, without implementing somewhat complex logic in the client.

Was this a known limitation I overlooked or should we look into adjusting the implementation so perhaps the collection of known nodes is used equally and when one is down, others can be used?

Hi @lseelenbinder ,
Yes, it was overlooked. Will try to come up with a solution for that bug.
Thanks for reporting

No problem, @AlexPikalov.

After I realized what was happening, I knew it was a bit of an edge case that wouldn't be too easy to accidentally replicate during testing, but quite common in production because machines are always coming and going during maintenance.

We're just going to revert to SingleNode and use HAProxy to load balance the actual instances, so this isn't a blocker for us to go into production.

@lseelenbinder
Good to know that it doesn't block you. However I think it's a good occasion to implement a feature that was requested almost 2 years ago #113 The solution itself may be based on Cassandra server events, namely on topology change: removed node. So if load balancer will remove a node reacting on this event, it may help to avoid the situation when load balancer returns a dead node

@AlexPikalov, that's a great idea!

My only concern is keeping the ability to limit which nodes a specific config would ever connect to, regardless of added or removed nodes (even if that means it has no live nodes to talk to).

Hi @lseelenbinder ,
I've just completed a draft implementation of dynamic clusters in #313.

These changes will remove dead node from cluster load balancing basing on received Topology Change event received from a node. I'm about to test it, but would really appreciate if you could check it from your end if it solves your case. Here is the new session factory function that will include this logic https://github.com/AlexPikalov/cdrs/blob/feat/113/src/cluster/session.rs#L236.

Comparing to new it has an extra argument event_src: NodeTcpConfig<'a, A> which is a configuration for a node that will be used as an event source.

So far, I've been able to find some issues with a proposed solution. Fixing it

Hi @AlexPikalov,

Thanks for fixing this! I won't have a chance to test it for a few days, but a one thing about the design is confusing for me.

NodeTcpConfig implies a single node is the source for the events, which, in my mind, doesn't actually help us any, since we still have a single point of failure. If that node happens to fail (or go down for maintenance, in the more likely scenario), we're still in the same position as before where one node failing causes issues across the cluster. Am I missing something in how it's intended to be used or how it works?

Our method of using HAProxy to balance local DC nodes is working quite well, and it looks like this method would probably require us to continue doing that for the event source.