Tweaking the `Config` for fast broadcast

Question

Tweaking the `Config` for fast broadcast

jeromegn opened this issue 2 years ago · comments

Jerome Gravel-Niquet commented 2 years ago

Hey, it's me again!

I've been reading the docs on the Config and trying to figure out how to tweak it. I'm just using the Config::simple() for now, it seemed sensible.

I noticed it took a full ~4-5 minutes for all broadcasts to fully propagate on a cluster of 6 nodes geographically far apart. I'm sending foca messages over UDP.

How would you tweak the config to make these broadcasts propagate faster? Perhaps increasing max_transmissions, as the documentation suggests?

I can't help but feel like 4-5 minutes is a very long time even with the default max_transmissions setting. If I send a single update it takes about 1 second to reach everywhere. If I send ~20+ broadcasts from nodes randomly as a test, it takes 4-5 minutes to fully propagate everything once I stop my test.

I'm testing this by updating values in a KV store and comparing the final state of each node. I'm diffing every state with every other state to make sure they're exactly the same. After the 4-5 minutes delay, I saw logs stopped printing my log line stating the node received a gossip broadcast item to process. This coincided with the state being the same everywhere.

caio · Answer 1 · Wed Apr 13 2022 15:00:24 GMT+0800 (China Standard Time)

It's so cool to be getting issues related to real usage, keep 'em coming! 😄

When I was writing the broadcasting example this popped up in my mind:

The current custom broadcast logic is essentially the same as the cluster update logic which has a very high bias for recency; i.e. The newer the broadcast, the higher the chances it will be picked up. This works really well for the "one node, one key" pattern (NodeConfig in the example), but it's terrible for when every individual broadcast is important.

I think the ideal solution here would be to either allow selecting the bias of the logic (older first vs newer first) or give full control to the user letting them manage the buffer (this might make it too clunky/boilerplate-y to use).

There's a holiday in the horizon, I'll be able to tackle this soonish. Meanwhile, I'd say you can do two things to improve your scenario:

Playing around with max_transmissions: I'd try making it smaller - your cluster is small and all messages were delivered, just took too long- to me this implies that it sent redundant info too many times
Increasing the frequency of dissemination: periodically calling foca.gossip() based on foca.custom_broadcast_backlog(). This may influence your tweaks on 1

Jerome Gravel-Niquet · Answer 2 · Wed Apr 13 2022 20:17:04 GMT+0800 (China Standard Time)

Thank you, that helped! I initially tweaked max_transmissions to 4 (from 10) while adding a manual foca.gossip() if foca.custom_broadcast_backlog() > 1 and somehow that didn't quite work out. In the end, the cluster was out of sync. So instead I set max_transmissions to 8 and this appears to resolve the situation.

The cluster had a synced state instantly when my load test was done.

I wasn't sure what to use for the backlog value here: foca.custom_broadcast_backlog() > 1. This is probably too aggressive!

Does calling foca.gossip() send all pending broadcasts to num_indirect_probes active members? Or just the oldest pending broadcast?

caio · Answer 3 · Thu Apr 14 2022 00:43:06 GMT+0800 (China Standard Time)

Yeah a small max_transmissions coupled with a high rate of gossiping can lead to broadcasts (and cluster updates) not fully propagating (foca can cope with it because every message it receives is essentially an update similar to NodeConfig, so it eventually converges to the truth, just takes longer).

memberlist has their config parameters change based on the cluster size (foca should do that too) and the formula is log(N+1) * Multiplier N=cluster size, Multiplier defaults to 4 (IIRC) and ceil(log(6 + 1) * 4) = 8 so I think that this max_transmissions = 8 you ended up with is a pretty good number 😄 They also have different parameters for WAN/LAN scenarios, might be worth it to use their config as a reference when crafting yours.

Whenever foca prepares a message (gossip or otherwise), it tries to stuff as much data as it can in max_packet_size bytes. First it packs the header, then cluster updates, then as many broadcasts possible giving priority to the least transmitted ones (the recency bias I mention). So when you call foca.gossip() it does that preparation num_indirect_probes times and sends them over.

It might be simpler to reason about your own logic and cluster state if you:

Immediately foca.gossip() after foca.add_broadcast(), so that it starts the dissemination fast
Then periodically gossip (I'd start with 500ms or so) regardless of the backlog size

This way the message frequency will be more predictable; You'd see the total byte size of messages increasing when you have a lot of stuff to broadcast, but the actual number of messages would remain stable. So it'll always be easy to compute the upper bound of the bandwidth foca can consume.

Jerome Gravel-Niquet · Answer 4 · Thu Apr 14 2022 01:27:47 GMT+0800 (China Standard Time)

Thank you. Funny that I fell on a good guessed value for max_transmissions!

That's good info, I'll give that a shot. As soon as I started testing with 14 nodes, I got a divergent state. According to formulae, I should bring up max_transmissions to 14. Let's see how that goes!

The real cluster where this will be deployed is over 100 nodes (so, 100 identities). Expected to grow to 1000+. The nodes communicate over a private network, but latencies can easily reach 300+ms between the furthest nodes (pretty much at opposite sides of the world).

What other settings should I tweak here? num_indirect_probes should probably scale too, right?

Jerome Gravel-Niquet · Answer 5 · Thu Apr 14 2022 10:35:28 GMT+0800 (China Standard Time)

the formula is log(N+1) * Multiplier N=cluster size

This turned out to be (I checked memberlist), in Rust:

((N as f64 + 1.0).log(10.0) * 4).ceil()

It might be simpler to reason about your own logic and cluster state if you:

Immediately foca.gossip() after foca.add_broadcast(), so that it starts the dissemination fast

Then periodically gossip (I'd start with 500ms or so) regardless of the backlog size

This worked out pretty well for short load tests.

If I leave my test on for 10+ minutes, I start seeing messages being retransmitted endlessly.

This is my very simple "foca runtime loop":

enum FocaInput {
    Announce(Actor),
    Data(Bytes),
    Broadcast(Bytes),
}

fn runtime_loop(
    mut foca: Foca<Actor, MessageCodec, StdRng, MessageReceiver>,
    mut rx_foca: UnboundedReceiver<FocaInput>,
    to_send_tx: UnboundedSender<(Actor, Bytes)>,
    notifications_tx: UnboundedSender<Notification<Actor>>,
) {
    let (to_schedule_tx, mut to_schedule_rx) = unbounded_channel();

    let mut runtime: DispatchRuntime<Actor> = DispatchRuntime::new(to_send_tx, to_schedule_tx, notifications_tx);

    let (timer_tx, mut timer_rx) = unbounded_channel();
    tokio::spawn(async move {
        while let Some((duration, timer)) = to_schedule_rx.recv().await {
            trace!("handling timer in {duration:?} => {timer:?}");
            let timer_tx = timer_tx.clone();
            tokio::spawn(async move {
                tokio::time::sleep(duration).await;
                timer_tx.send(timer).ok();
            });
        }
    });

    tokio::spawn(async move {
        let mut gossip_interval = tokio::time::interval(Duration::from_millis(500));
        loop {
            tokio::select! {
                input = rx_foca.recv() => match input {
                    Some(input) => {
                        let result = match input {
                            FocaInput::Announce(actor) => foca.announce(actor, &mut runtime),
                            FocaInput::Data(data) => foca.handle_data(&data, &mut runtime),

                            // broadcast _and then_ gossip
                            FocaInput::Broadcast(data) => foca.add_broadcast(&data).and_then(|_| foca.gossip(&mut runtime)),
                        };

                        if let Err(error) = result {
                            error!("foca error: {}", error);
                        }
                    },
                    None => {
                        warn!("no more foca inputs");
                        break;
                    }
                },
                timer = timer_rx.recv() => match timer {
                    Some(timer) => {
                        if let Err(e) = foca.handle_timer(timer, &mut runtime) {
                            error!("foca: error handling timer: {e}");
                        }
                    },
                    None => {
                        warn!("no more foca timers, breaking");
                        break;
                    }
                },
                _ = gossip_interval.tick() => {
                    if let Err(error) = foca.gossip(&mut runtime) {
                        error!("foca gossip error: {}", error);
                    }
                }
            }
        }
    });
}

My load test sends 5 requests concurrently (triggering 5 broadcasts), each to a random server (out of 14) and it does that 10x (total 50 broadcasts) and then waits between 1 and 3 seconds and starts a new iteration of the loop.

It's far from a perfect test, but it does help to figure out how the system behaves.

As I stopped the test, I still saw messages about receiving broadcast one most nodes up to 7 minutes after the end of the test. I'm not sure why it would keep sending broadcasts so long after, given I'm using such a tight loop.

I thought maybe it was a resource issue. I'm running the test with pretty resource-constrained nodes (1/8th of a CPU and 256MB of RAM). However, if I observed resources on a node while I was running the test, it barely broke a sweat: 1-2% CPU used and only about 30MB of RSS memory. Network usage is pretty low: 50Kb/s both ways.

Jerome Gravel-Niquet · Answer 6 · Fri Apr 15 2022 02:10:01 GMT+0800 (China Standard Time)

Small update: Using Log(10) made it so some gossip messages never reached some nodes. Tweaked it to Log(e) to get the same number we initially discussed. Works better! (I'm still seeing nodes getting messages a few minutes after I've sent everything... but this could just be the network, looks like it was a node in Singapore)

caio · Answer 7 · Fri Apr 15 2022 02:12:56 GMT+0800 (China Standard Time)

That's good info, I'll give that a shot. As soon as I started testing with 14 nodes, I got a divergent state. According to formulae, I should bring up max_transmissions to 14. Let's see how that goes!

I think you just typoed but just in case: 14 max_transmissions for 14 nodes sounds wrong, the log in the formula is supposed to dampen the growth heavily (max_tx should never be a very large number; combined with the fan out to num_indirect_probes updates should propagate super fast) (EDIT: replied before seeing the new comment- all good now!)

The real cluster where this will be deployed is over 100 nodes (so, 100 identities). Expected to grow to 1000+. The nodes communicate over a private network, but latencies can easily reach 300+ms between the furthest nodes (pretty much at opposite sides of the world).

What other settings should I tweak here? num_indirect_probes should probably scale too, right?

Whoa, super cool! SWIM (thus foca, assuming no outstanding bugs) shouldn't have a problem keeping up with the cluster size, but you should probably keep in mind that having a large latency variance between the nodes will essentially force you to tune the configuration for the slowest case. Not a terrible thing, but will hurt the speed in which the protocol can detect a node failure.

If detecting failures is very important for your case, I'd try to architect the network to have a tree-like shape instead of a massively inter-connected graph. Say, clustering the nodes by region (EU, NA, etc) and something else reconciling the state for the whole world. Might not be possible depending on what you're trying to do tho.

Serf does a lot more on top of memberlist, but it's the same protocol and their convergence simulator might help you getting a better grasp for how the configuration influences the behaviour: https://www.serf.io/docs/internals/simulator.html

As I stopped the test, I still saw messages about receiving broadcast one most nodes up to 7 minutes after the end of the test. I'm not sure why it would keep sending broadcasts so long after, given I'm using such a tight loop.

It's hard to say much about this without more details- are you able to get a working example to share? The runtime loop you shared looks perfectly fine to me so it's either something wrong with the broadcast handler (say receive_item yielding Ok(Some(...)) more often than it should), with the config or, of course, with foca itself.

If sharing is not possible, here's what might help figuring out what's going on:

Keep an eye on the custom broadcast backlog per node (if your broadcasts are large, max_packet_size will be the bottleneck for draining this backlog)
The count of unique broadcasts seen per node (so you can compare with your external knowledge of how many broadcasts you are sending)

Notice that a node will keep sending broadcasts to its peers until each broadcast has been transmitted max_transmissions, so even after every node receives every unique broadcast you sent, there will still be some chatter in the network until it's all been flushed.

I thought maybe it was a resource issue. I'm running the test with pretty resource-constrained nodes (1/8th of a CPU and 256MB of RAM). However, if I observed resources on a node while I was running the test, it barely broke a sweat: 1-2% CPU used and only about 30MB of RSS memory. Network usage is pretty low: 50Kb/s both ways.

Nice! That's a good indication that at least some things are working as intended 😁 SWIM is supposed to be super lightweight on the network and foca was written with keeping cpu/mem usage low.

caio · Answer 8 · Fri Apr 15 2022 02:13:52 GMT+0800 (China Standard Time)

Small update: Using Log(10) made it so some gossip messages never reached some nodes. Tweaked it to Log(e) to get the same number we initially discussed. Works better! (I'm still seeing nodes getting messages a few minutes after I've sent everything... but this could just be the network, looks like it was a node in Singapore)

Ahhh that's why the number looked odd then! Ignore the beginning of my previous message then, cool!

Jerome Gravel-Niquet · Answer 9 · Fri Apr 15 2022 05:08:39 GMT+0800 (China Standard Time)

If detecting failures is very important for your case, I'd try to architect the network to have a tree-like shape instead of a massively inter-connected graph. Say, clustering the nodes by region (EU, NA, etc) and something else reconciling the state for the whole world. Might not be possible depending on what you're trying to do tho.

Node failure detection isn't a very important part of our use case 😄.

Serf does a lot more on top of memberlist, but it's the same protocol and their convergence simulator might help you getting a better grasp for how the configuration influences the behaviour: https://www.serf.io/docs/internals/simulator.html

I had seen it before. Pretty cool widget.

How do each of their knobs map to foca's Config? "Gossip interval" might map to my interval where I trigger gossips outside of foca's normal operation.

According to their chart, it should almost never take more than 3s for 99.99% convergence! That's not entirely what I observed. We have nodes with a RTT of 300ms, that's possibly as far as it gets. I don't think we have much packet loss either (if any).

It's hard to say much about this without more details- are you able to get a working example to share? The runtime loop you shared looks perfectly fine to me so it's either something wrong with the broadcast handler (say receive_item yielding Ok(Some(...)) more often than it should), with the config or, of course, with foca itself.

If sharing is not possible, here's what might help figuring out what's going on:

Keep an eye on the custom broadcast backlog per node (if your broadcasts are large, max_packet_size will be the bottleneck for draining this backlog)

The count of unique broadcasts seen per node (so you can compare with your external knowledge of how many broadcasts you are sending)

I can probably share more soon, but this is very much in a Proof-of-Concept state. The runtime loop I shared is the only thing that interacts with foca.

Here's the broadcast handler:

struct MessageReceiver {
    actor_id: ActorId,
    msg_tx: UnboundedSender<GossipMessage>,

    disseminated: HashSet<(ActorId, Timestamp)>,
    processed: Arc<RwLock<HashSet<(ActorId, Timestamp)>>>,
}

impl MessageReceiver {
    pub fn new(actor_id: ActorId, msg_tx: UnboundedSender<GossipMessage>, processed: Arc<RwLock<HashSet<(ActorId, Timestamp)>>>) -> Self {
        Self {
            actor_id,
            msg_tx,
            disseminated: HashSet::new(),
            processed,
        }
    }
}

const SIZE_OF_U64: usize = std::mem::size_of::<u64>();

impl BroadcastHandler for MessageReceiver {
    type Broadcast = RawGossipMessage;

    type Error = BroadcastError;

    fn receive_item(&mut self, mut data: impl bytes::Buf) -> Result<Option<Self::Broadcast>, Self::Error> {
        trace!("receive_item!");
        let remaining = data.remaining();
        trace!("remaining: {remaining}");
        if remaining < SIZE_OF_U64 {
            return Err(BroadcastError::NotEnoughBytes);
        }

        let len = { data.chunk().get_u64() } as usize; // big endian length
        trace!("msg len: {len}");
        let full_len = SIZE_OF_U64 + len;

        if remaining < full_len {
            return Err(BroadcastError::NotEnoughBytes);
        }

        let raw = RawGossipMessage(data.copy_to_bytes(full_len));

        trace!("checking timestamp");
        let timestamp = raw.timestamp().map_err(|e| BroadcastError::Validation(e.to_string()))?;
        trace!("TIMESTAMP: {timestamp:?}");

        let actor_id = raw.actor_id().map_err(|e| BroadcastError::Validation(e.to_string()))?;

        if !self.disseminated.insert((actor_id, timestamp)) {
            trace!("already seen, stop disseminating");
            return Ok(None);
        }

        trace!("never disseminated before, disseminate!");

        // TODO: update clock from somewhere
        // if let Err(_e) = self.clock.update_with_timestamp(&timestamp) {
        //     warn!("unable to update clock");
        // }

        if actor_id != self.actor_id && { self.processed.write().insert((actor_id, timestamp)) } {
            match raw.parse() {
                Ok(msg) => {
                    self.msg_tx.send(msg).ok();
                }
                Err(e) => {
                    error!("could not parse raw message: {e}");
                }
            }
        }

        Ok(Some(raw))
    }
}

This is a pretty naïve implementation 😄 but it works for testing purposes. Eventually I should just assign a message ID and check that instead of the timestamp.

I have to keep an extra HashSet of processed messages because we're trying to optimize gossip dissemination between nodes in the same region. So when a node gossips, it also sends the same message via HTTP to all nodes of the same "group" (region).

I have the 2 HashSet because there's definitely a distinction between "should disseminate" and "should process" as far as I understand.

Notice that a node will keep sending broadcasts to its peers until each broadcast has been transmitted max_transmissions, so even after every node receives every unique broadcast you sent, there will still be some chatter in the network until it's all been flushed.

That makes sense to me and I was expecting that. Unfortunately the nodes waiting several minutes for updates needed them. I'm only logging when actual never-seen-before updates arrive.

I'll keep tweaking the config until I get faster dissemination.

Jerome Gravel-Niquet · Answer 10 · Fri Apr 15 2022 05:52:15 GMT+0800 (China Standard Time)

Hmm, did I do something wrong? This max_transmissions value seems high: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=4eec5d024a9e78cb0aed959d6c550777

I'm using log(e) as is default in Ruby and Python, even if memberlist uses log(10)

caio · Answer 11 · Sat Apr 16 2022 00:43:08 GMT+0800 (China Standard Time)

How do each of their knobs map to foca's Config? "Gossip interval" might map to my interval where I trigger gossips outside of foca's normal operation.

According to their chart, it should almost never take more than 3s for 99.99% convergence! That's not entirely what I observed. We have nodes with a RTT of 300ms, that's possibly as far as it gets. I don't think we have much packet loss either (if any).

Argh this simulator is actually not very useful for your case: it helps getting a better intuition about tuning the swim parts of foca and speed to propagate the cluster updates about failed nodes, which is similar to broadcasting with NodeConfig style. I keep forgetting that for your use-case convergence === delivery of every update; Sorry about the wrong cue.

I have to keep an extra HashSet of processed messages because we're trying to optimize gossip dissemination between nodes in the same region. So when a node gossips, it also sends the same message via HTTP to all nodes of the same "group" (region).

Thanks for sharing the code, looks fine to me too! This region-based stream dissemination coupled with not needing to care about node failures (excellent scenario 😁 ) paints a clearer picture of the direction you're going.

So this got me thinking that you might want to do a better selection of who foca sends the custom broadcasts to (only disseminate to outside the region, for example), so I was going to suggest adding an api for that, but another thought came up:

There are many knobs to turn (everything in the config, how to select the broadcasts to send, which members to talk to maybe, probably more) - The benefit of all that happening inside foca is somewhat small:

you get the guarantee that only broadcasts from active cluster members are processed
it manages the retransmission logic for you

So I'm thinking of experimenting with disconnecting the broadcast handler from the foca instance. Usage would end up something like (pseudocode):

# Receiving
data = receive_from_network()
is_valid, bytes_read = foca.handle_data(data)

if is_valid:
      broadcast_handler.receive_item(data[bytes_read..])
      ... 

# For sending you could
# - attach the data you choose to broadcast on `runtime::send_to`, kinda like how it works now; Or
# - pick members however you want from foca and send just your updates

So, pretty similar to typical framed/enveloped network packet ([ethernet[ip[udp[data]]]) handling. It would let you control any knob and instrument at will without having to evolve foca::Config; And foca can provide support by, for example, making Broadcasts public and example code.

What do you think? It's gonna entail some largeish changes (Codec will likely need to change), but I find the idea very attractive since it keeps the configuration surface for swim/foca very lean.

Hmm, did I do something wrong? This max_transmissions value seems high: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=4eec5d024a9e78cb0aed959d6c550777

I'm using log(e) as is default in Ruby and Python, even if memberlist uses log(10)

Good catch! I'd follow memberlist's footsteps. I'm doing guesswork as much as you on this end 😅 ATM I only operate a tiny foca LAN cluster with no custom broadcasting.

Jerome Gravel-Niquet · Answer 12 · Sun Apr 17 2022 01:44:51 GMT+0800 (China Standard Time)

What do you think? It's gonna entail some largeish changes (Codec will likely need to change), but I find the idea very attractive since it keeps the configuration surface for swim/foca very lean.

I enjoy not having to deal with too much logic around that to be honest!

The benefit of all that happening inside foca is somewhat small:

you get the guarantee that only broadcasts from active cluster members are processed

it manages the retransmission logic for you

These aren't that small :)

Perhaps it would go a long way to add a simple Fn<T: Identity>(T) -> bool to make various decisions. A bit like a middleware? Maybe this is scope creep!

Ideally, I wouldn't even need to do the "direct gossip" to same-region-nodes if I could control the gossip "routing" a bit more, but I don't want to deal with retransmission or making sure the messages are sent to active nodes.

Unrelated: looks like setting my gossip interval to 100 or 200ms really makes state converge incredibly fast.

caio · Answer 13 · Sun Apr 17 2022 18:45:43 GMT+0800 (China Standard Time)

Yeah, you're right- "foca does the broadcasting for you" is a pretty good selling feature. Besides, I can still unlock the pattern I mentioned without needing to drop all the broadcasting features if the need arises in the future.

Perhaps it would go a long way to add a simple Fn<T: Identity>(T) -> bool to make various decisions. A bit like a middleware? Maybe this is scope creep!

Ideally, I wouldn't even need to do the "direct gossip" to same-region-nodes if I could control the gossip "routing" a bit more, but I don't want to deal with retransmission or making sure the messages are sent to active nodes.

I like the idea. Here's what we can do:

A new (optional) method on BroadcastHandler trait to decide whether to include broadcast data in the payload or not based on the dst identity (don't like the name, but smth like BroadcastHandler::should_broadcast(dst: &Identity) -> bool)
This would mean that foca.gossip() may end up not disseminating any broadcast when called (say, every member it picked is false for should_broadcast), we'd need to introduce a way to guarantee the dissemination of custom broadcasts. So: foca.broadcast() that works similarly as foca.gossip(), but takes into account should_broadcast before selecting the members to send a message to.
Introduce a new message type (Message::Broadcast) that only contains the header and broadcasts, no custom updates. This is because otherwise it will break the update dissemination math: instead of disseminating updates via a round-robin-then-shuffle it would send most updates to members that should_broadcast == true

Would that help? I think it covers your "simple Fn<T: Identity>(T) -> bool" idea; Not sure if powerful enough for arbitrarily fancy routing.

I'm happy to evolve BroadcastHandler as much as necessary to unlock this use-case; The only thing that concerns me is growing Foca::Config because it already has too many knobs, but for this feature there's zero need to change it :)

Unrelated: looks like setting my gossip interval to 100 or 200ms really makes state converge incredibly fast.

Nice! With better broadcast routing you might even be able to slow this rate down a bit

Jerome Gravel-Niquet · Answer 14 · Sun Apr 17 2022 21:49:01 GMT+0800 (China Standard Time)

Would that help? I think it covers your "simple Fn<T: Identity>(T) -> bool" idea; Not sure if powerful enough for arbitrarily fancy routing.

I think that's pretty good.

I've had a different thought: What if that same BroadcastHandler::should_broadcast(dst: &Identity) -> bool function instead let us return an enum instead, like:

enum BroadcastDecision {
  Default,
  Yes,
  No,
}

This is very badly named, but essentially letting the user decide if special handling is required for an Identity or if it should just use the default algorithm (random 3 indirect active nodes).

Maybe Option<bool> would do the same. None would trigger the default behaviour.

In the actual code, this would mean "pooling" identities as:

Random (pick 3, broadcast)
Don't send
Do send

Maybe this is dumb, I'm not sure! I don't really want to handle broadcast decisions's randomness (even if it's not that hard).

I'm happy to evolve BroadcastHandler as much as necessary to unlock this use-case; The only thing that concerns me is growing Foca::Config because it already has too many knobs, but for this feature there's zero need to change it :)

Makes sense! I appreciate a simple config.

caio · Answer 15 · Mon Apr 18 2022 17:37:35 GMT+0800 (China Standard Time)

Maybe this is dumb, I'm not sure! I don't really want to handle broadcast decisions's randomness (even if it's not that hard).

It's not a dumb thought at all, but if I got it correctly just the bool type will do what you want and the confusion happened because of my bad choice of name for the method I think (should be read something along the lines of "should attach custom broadcast data to a message i'm about to send to this member" instead of just should_broadcast)

I think seeing code will help understand the flow better, but the idea is that Foca will call this method every time it sends a message to any member during its normal opeartion. By default it always appends the broadcasts to the message, this will let you decide when not to do that - the messages would still be sent.

The "problem" with doing just that is that now the assumption that foca.gossip() will always help disseminating broadcasts will not be true anymore because it may pick only members that you're returning false for, so the gossip messages will only contain cluster updates, not your broadcasts. If we stopped here, I'd bet that it would hurt your convergence tests.

That's why I propose (on 2) the new foca.broadcast() api to guarantee that it picks members that CAN receive this data- so now instead of periodically calling foca.gossip() every few millisseconds, you'll be calling foca.broadcast(). On 3 I explain why I'd need to introduce a new message type instead of reusing an existing one, but I think it only added to the confusion 😅

I'll start hacking on it, hopefully will have some code to show to make this clear

caio · Answer 16 · Mon Apr 18 2022 19:17:48 GMT+0800 (China Standard Time)

The PR is ready!

If you take a look at the foca.broadcast() impl, you'll see that it's eerily similar to foca.gossip(). The significant change is just here: https://github.com/caio/foca/blob/broadcast_routing/src/lib.rs#L392=

Gonna wait a little bit to merge just to give you a chance to interrupt me in case this doesn't help at all- It's quite possible we'll have to iterate on things a bit to get it right so don't worry about reviewing it properly, a simple "it looks like I'll be able to use it" will be very appreciated. And if you're too busy to do it today, I'll ship it anyway EOD and we deal with required changes as they come :)

caio · Answer 17 · Mon Apr 18 2022 22:02:01 GMT+0800 (China Standard Time)

v0.2.0 released, thanks again for all the input! Looking forward to hearing how this goes 😄

Jerome Gravel-Niquet · Answer 18 · Mon Apr 18 2022 22:13:50 GMT+0800 (China Standard Time)

Sure thing! Thank you!

I'm going to give this a shot.