caio / foca

mirror of https://caio.co/de/foca/

Home Page:https://caio.co/de/foca/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`updates_backlog` grows to a certain number and stays there

jeromegn opened this issue Β· comments

I haven't been looking at it, but foca.updates_backlog() appears to be reporting 10832. Restarting my program just makes it grow back up to almost the same number of backlog items.

Is this because we're trying to retransmit too much?

image

(Thats a single node, it restarted at the time of that dip)

All nodes:

image

commented

😱 that's not good

It could be caused by a unreasonably large Config::max_transmissions; But I don't think that's the case otherwise you'd see the backlog decreasing every now and then (a healthy cluster should have a small backlog- even going down to zero sometimes if things are super stable- with the occasional increase when a node rejoins / there's a mass restart)

Your effective members count (your Members hashmap, keyed by ActorId) is still reasonable and you're using a plain Config::new_wan() (or something very similar), right?

If that's the case, this means that the updates backlog is full of repeated actorids

Apart from manually crafted packets, this is highly indicative of two or more members declaring each other as down non-stop. There's only one scenario I can think of that would lead to such a large volume of updates tho:

Two or more members "competing" for the same prefix (as in Identity::has_same_prefix), which would trigger this code. Is there any chance you have an ActorId clash between two nodes? Or maybe an address clash, with nil actorid?

It's weird that the backlog size seems to stabilize, but I think (with very little confidence, I lack the math chops) that the growth to infinity is prevented by the frequent gossiping and the natural delay to propagate cluster updates within the cluster... I'll try to think of more scenarios where this can happen

Your effective members count (your Members hashmap, keyed by ActorId) is still reasonable and you're using a plain Config::new_wan() (or something very similar), right?

Yes, we're using this:

fn make_foca_config(cluster_size: NonZeroU32) -> foca::Config {
    let mut config = foca::Config::new_wan(cluster_size);

    // max payload size for udp over ipv6 wg - 1 for payload type
    config.max_packet_size = EFFECTIVE_CAP.try_into().unwrap();

    config
}

Our cluster size is ~327 nodes.

I just looked at each actor ID and they don't conflict (I would've been surprised if they did).

Would it be possible to add a statisticss function to dump counters for various interesting stuff? We're using prometheus and I could set a few gauges per node. Maybe the count for each kind of message in the backlog? I doubt it's observable like that in its current state.

It's weird that the backlog size seems to stabilize, but I think (with very little confidence, I lack the math chops) that the growth to infinity is prevented by the frequent gossiping and the natural delay to propagate cluster updates within the cluster... I'll try to think of more scenarios where this can happen

That's possible! It does grow a little bit over time.

Small update: I triple checked to make sure there were no nodes with older foca versions or the same prefix. That's likely not the problem.

I forked foca and added a log line near the one you linked to see how often that branch was called. It's definitely called, but it does appear to only be due to the node registering as down from other nodes. It's also not happening nearly enough to reach the huge updates backlog we're seeing.

I still haven't tracked down #18, which may be related?

I fear this backlog is slowing down update propagation quite a bit.

commented

Would it be possible to add a statisticss function to dump counters for various interesting stuff? We're using prometheus and I could set a few gauges per node. Maybe the count for each kind of message in the backlog? I doubt it's observable like that in its current state.

That's definitely a great feature to have! I wonder if collecting the stats inside foca is the best approach- maybe it's better to add support for a metrics collector so that every aggregation/rollout it might do ends up with the same behaviour? Or maybe a prometheus-specific impl for starters? I'm not familiar with any metrics stack for rust so I'm open to ideas how to approach this

That said, I'm not sure it will help with figuring out this particular issue since every message foca sends will be emitting the same sort of updates and the updates backlog doesn't contain a reference to the first-id+message-that-sent-this-update.

What I think would help figuring this out is inspecting what's inside the backlog- I can write up a patch in a separate branch to expose a foca.dump_updates_backlog() then you can pick it and dump it from a single node (doesn't matter which, the data should be roughly the same in the cluster) and see if there's a pattern when grouping by actor id.

Any chance you have a member in a sort of crash restart loop state? Say, it joins the cluster, crashes, then immediately restarts with a new id (a new bump I reckon)

That's definitely a great feature to have! I wonder if collecting the stats inside foca is the best approach- maybe it's better to add support for a metrics collector so that every aggregation/rollout it might do ends up with the same behaviour? Or maybe a prometheus-specific impl for starters? I'm not familiar with any metrics stack for rust so I'm open to ideas how to approach this

I believe this is what a lot of people use: https://github.com/metrics-rs/metrics (we do).

I can write up a patch in a separate branch to expose a foca.dump_updates_backlog() then you can pick it and dump it from a single node (doesn't matter which, the data should be roughly the same in the cluster) and see if there's a pattern when grouping by actor id.

That sounds good and easy to do for us. We have special testing nodes that are part of the production cluster.

Any chance you have a member in a sort of crash restart loop state? Say, it joins the cluster, crashes, then immediately restarts with a new id (a new bump I reckon)

I don't believe so. We have alerts for frequently-restarting services and I haven't seen that for corrosion (the thing that uses foca). It often runs for days / weeks at a time without a restart.

It'd be fine if a function only returned the counters or whatever interesting data that people can periodically call and then send into the metrics store of their choosing.

I've added a function to dump all entries in the updates field. Of course they're all ClusterUpdate, but they appear to be for various nodes. Not any specific one.

Could it be that we're announcing way too often? I'm not making any changes to identities or renewing.

Is there something periodically announcing nodes, even if no changes were made? I'm looking through the code and it does look like things are added in there at various times. Can this even happen on Ping? Sorry if this is a dumb question, but this is hard to grok (I haven't spent much time on it).

I started dumping members every 10 seconds from our test node and bump never changes or any other information about the node (I'm using diff here to compare the previous 10 seconds with the latest).

I noticed: It takes a long time for a node to be aware of all other nodes. Around 20-30 minutes! I don't know why that is. I feel like gaining knowledge of the whole cluster should be a priority.

commented

i've sent a PR to your fork with what i think would be useful jeromegn#1

(feel free to scrub the dump if you're uncomfortable sharing it. happy to receive it on my email too if you don't want it public)

Could it be that we're announcing way too often? I'm not making any changes to identities or renewing.

Is there something periodically announcing nodes, even if no changes were made? I'm looking through the code and it does look like things are added in there at various times. Can this even happen on Ping? Sorry if this is a dumb question, but this is hard to grok (I haven't spent much time on it).

Not a dumb question at all :) it isn't super straightforward because there are two states being managed at once (the members list and cluster updates) and change in members leads to change in updates, but change in updates doesn't necessarily lead to change in members

An instance can announce every microsecond and this shouldn't cause issues (apart from terrible bandwidth cost). Lemme try to give an outline:

every message ends up hitting this code, which leads to mutating the updates buffer here if there's any change detected (either the identity changed OR the Incarnation/State (from swim) did)

So what happens is that for every received (and accepted) message we may mutate the updates backlog multiple times:

  • Once for the sender: Try to apply a state update with (msg.src_identity, msg.src_incarnation, State::Alive). If that succeeds (it's either a new unknown cluster member or a known one that has incremented their incarnation (the failure detection machinery from swim))
  • N times, based on the number of updates that came along with the message (same logic, but state is not always Alive here- it's whatever the update says)

I noticed: It takes a long time for a node to be aware of all other nodes. Around 20-30 minutes! I don't know why that is. I feel like gaining knowledge of the whole cluster should be a priority.

SWIMs priority is disseminating the freshest updates (i.e.: the most recent cluster changes). This means that it may take a long while for members to learn about every node in the cluster: if they are super stable, the only way they can learn about them is when these nodes decide to ping this new member

This is where periodic announce is useful: with it, we start actually learning about every active member in the cluster, even if it's super stable and never changes anything in their state.

The reason it's taking so long for your nodes to learn about the full cluster is directly related to the massive backlog length here. If my understanding of the scenario of this bug is correct, foca thinks that there are ~10k members in the cluster (i.e.: foca.num_members is very high, similar order of magnitude of your updates_backlog)

The debug info PR I sent you should help pinpoint where to start looking at because it will tell us precisely what sort of info in being disseminated


We can definitely minimise this problem by making identity a lot more strict - it's a somewhat large breaking change that I've been meaning to do but not super difficult. I think it's better to diagnose this before doing any change in that because it might be a logical bug inside foca that we would be hiding by making ids strict

i've sent a PR to your fork with what i think would be useful jeromegn#1

Deploying this in a bit, will report back!

We can definitely minimise this problem by making identity a lot more strict

What does "strict" mean here?

This got me thinking that our identity is somewhat big. I could split it off and only keep the SocketAddr + bump, then put the rest of the data as an "operation" (which is the actual data we use the cluster to propagate). I think it'd make more sense.

It also got me thinking, maybe this issue is just because I'm not prioritizing foca "payloads" in my code. I'm mixing in a bunch of other types of messages in there (like custom broadcasts) which would likely happen so often that there could be hundreds / thousands of broadcasts before we get to process / send foca payloads. it makes a lot of sense with some of the alerts I've setup where a node would "stop gossiping" for a minute, likely because it has a large backlog of broadcasts to process.

Ok, I have a JSON dump. It's a lot of work to censor it. It's not entirely sensitive, but better safe than sorry. I'll send you an email with the whole thing!

{
  "current_identity": {
    "id": "809865b6-2519-487d-9a69-7efd023000a2",
    "name": "dev-hostname",
    "addr": "[fc01:a7b:ac7::]:7878",
    "region": "dev",
    "datacenter": "hil1",
    "write_mode": "read-write",
    "instance_id": "8bd7b174-dca7-496b-bd62-3f165c0e284e",
    "bump": 43034
  },
  "total_members": 6390,
  "active_members": 347,
  "backlog_len": 12347
}

(This was taken during a mass restart)

What's immediately catching my eye is the "total_members": 6390. Is that right? Sounds wrong!

I have put all foca-related operations in its own tokio task / loop and the effect seems to be that nodes register all other nodes faster. Unencumbered by all the custom broadcasts.

commented

Sorry, this is going to be an even larger message than usual πŸ˜… I know it's hard to follow these things, so I put some bold text near where I think it needs close attention.

Looks like my conjecturing it the previous wall-of-text was correct:

The reason it's taking so long for your nodes to learn about the full cluster is directly related to the massive backlog length here. If my understanding of the scenario of this bug is correct, foca thinks that there are ~10k members in the cluster (i.e.: foca.num_members is very high, similar order of magnitude of your updates_backlog)

Notice how it looks like there's a correlation between total_members and backlog_len.

Thanks for the data! I hacked a little script to convert the backlog output into a csv:

import sys
import json
import csv

writer = csv.writer(sys.stdout)
data = json.load(open("data.json", "r"))

id_keys = ['id', 'name', 'addr', 'region', 'datacenter', 'write_mode', 'instance_id', 'bump']
header = id_keys + ["incarnation", "state"]

writer.writerow(header)

for entry in data["backlog"]:
    d = entry["decoded_data"]
    id = d["id"]

    row = [id[k] for k in id_keys]
    row.extend([d["incarnation"], d["state"]])

    writer.writerow(row)

Then loaded it into sqlite to make it easy to explore

sqlite> .import loadme.csv foca --csv
sqlite> .schema foca
CREATE TABLE IF NOT EXISTS "foca"(
  "id" TEXT,
  "name" TEXT,
  "addr" TEXT,
  "region" TEXT,
  "datacenter" TEXT,
  "write_mode" TEXT,
  "instance_id" TEXT,
  "bump" TEXT,
  "incarnation" TEXT,
  "state" TEXT
);

And some things popped right up (not dumping the output so I don't expose your data).

When grouping by addresses (select addr,region, count(1) from foca group by 1 order by 3 desc limit 10;) the top 2 have a way higher count than the rest and they are both in the same region (Chile, I think?).

QUESTION Is the connectivity between regions working fine? i.e.: can nodes for any region talk to members from every other regions?

Looking at the bump from the address with the largest count we see that it is (mostly) contiguous:

sqlite> select cast(bump as integer) from foca where addr = 'THE-TOP-ADDR' order by 1 desc limit 5;
64661
64109
64108
64107
64106

Assuming your bump is initialised randomly, this implies that there's definitely something declaring that node down and it detects it, triggers Identity::renew() to increment the bump and rejoins.

Now, the fact that this is mostly happening for every identity in your cluster (just more on those top two) is quite interesting too.

It may be that there's a combination of multiple factors:

  1. Do you call foca.leave_cluster() before restarting? Not doing it means that a random member of the cluster will have to try pinging the old identity to discover it has left, slowing down the overall knowledge propagation (since it's wasting ping cycles and updates with a offline id)
  2. Since this is still on the same PayloadKind plane it means that the updates will persist in the cluster until it gets fully disseminated at forgotten
  3. With this incredibly large backlog it may be that by the time one node receives an update about a member being down, it has already seen it but forgotten about it (i.e.: Config::remove_down_after elapses, the node forgets about it, then learns about it again because of a retransmission and the update goes back to backlog)

We need to flush this backlog from the cluster otherwise we'll be chasing ghosts.

Recommendations:

  1. Introduce a new PayloadKind se we can start with an clean cluster-wide updates backlog
  2. Make Config::remove_down_after very high. 2+hours, for example. It's fine to do so: since you use a bump your members can still rejoin no problem and remembering a dead node for very long will still cost less RAM than the current massive persistent backlog :D
  3. Add a log line here, where foca declares a node down - so that we can know which node is declaring others down (I'll have some real free time next week, hopefully I'll get to address #14)

Once that's in place either the problem might completely disappear (i.e. foca was indeed chasing ghosts and kept forgetting they have been found in the past) or it will slowly creep back up to massive numbers (i.e.: there really is a problem with the cluster connectivity OR a bug in foca). If the backlog does increase again, then we do the foca.debug_info() dance again, now with a clean slate that will hopefully give a better direction on where to look at.


This got me thinking that our identity is somewhat big. I could split it off and only keep the SocketAddr + bump, then put the rest of the data as an "operation" (which is the actual data we use the cluster to propagate). I think it'd make more sense.

I agree. A small identity would definitely help disseminating knowledge faster and increase the speed in which a node discovers every member of the cluster.

It's how envisioned custom broadcasts being used originally: you have a tiny id and send a custom broadcast with the extra metadata (like the NodeConfig thing in the broadcasting example)

Could even go wild and trim the ipv6 prefix from your addresses as they're all the same; port number too. Not necessary really necessary but the smaller the identities, the faster knowledge propagates.

What does "strict" mean here?

I'm not sure what exactly it would entail, but what I mean by strict is forcing total ordering: given two identities with the same addr/prefix we'd need a way to define which is the valid one (something similar to the invalidates() concept from custom broadcasts- or, more generally, identities could be forced to implement core::cmp::Ord and foca would prefer the Ordering::Greater and discard all knowledge of the other ones)

Behaving like this would essentially make the problem we're looking at invisible. What we're looking at is multiple identities with the same address and different bump. If foca knew which is the right identity+bump to keep, it wouldn't need to talk about any of the "older" bumps.

Thanks for looking into the data!

QUESTION Is the connectivity between regions working fine? i.e.: can nodes for any region talk to members from every other regions?

They can, but they're at the whims of the internet since nodes are globally distributed. As you noticed, the situation is a lot worse in some regions, compared to others. However, it shouldn't be that bad, maybe there's something to look at here, I can check with our network provider.

Assuming your bump is initialised randomly, this implies that there's definitely something declaring that node down and it detects it, triggers Identity::renew() to increment the bump and rejoins.

Yes, it's initialized randomly and then incremented on rejoin. Like in your simple example in this repo.

  1. Do you call foca.leave_cluster() before restarting? Not doing it means that a random member of the cluster will have to try pinging the old identity to discover it has left, slowing down the overall knowledge propagation (since it's wasting ping cycles and updates with a offline id)

I'm not. I can add that!

  1. Introduce a new PayloadKind se we can start with an clean cluster-wide updates backlog

Ah, I see. The backlog won't ever clear itself? Even on restart?

  1. Make Config::remove_down_after very high. 2+hours, for example. It's fine to do so: since you use a bump your members can still rejoin no problem and remembering a dead node for very long will still cost less RAM than the current massive persistent backlog :D

I can try this first if it can make a difference. But if the backlog won't clear itself, I guess it's not worth it on its own.

This won't affect "detecting" a node as down, right? We're thinking reliable detection of liveness is going to be useful very soon (it is already, but we're not using it yet).

I agree. A small identity would definitely help disseminating knowledge faster and increase the speed in which a node discovers every member of the cluster.

It's how envisioned custom broadcasts being used originally: you have a tiny id and send a custom broadcast with the extra metadata (like the NodeConfig thing in the broadcasting example)

While propagating that information via custom broadcasts works, it doesn't guarantee any kind of consistency. That's why we have all kinds of fallback mechanisms in our project (periodic synchronization) to ensure eventual consistency.

Makes sense though. I'll modify the identities!


Whew, that's going to be a lot of changes all at once πŸ˜„

commented

Ah, I see. The backlog won't ever clear itself? Even on restart?

The instance starts with no backlog, but as soon as it joins the cluster it starts receiving the updates from the other members; And I believe that a relatively small Config::remove_down_after coupled with this massive backlog is making nodes forget the knowledge too soon:

sqlite> select state, count(1) from foca group by 1;
Alive|310
Down|12076
Suspect|6

The backlog contains 12k identities that are down whilst total_members - active_members is ~6k. I think they are getting forgotten before it finishes propagating and then the node learns about them being down again as if it were new information.

I can try this first if it can make a difference. But if the backlog won't clear itself, I guess it's not worth it on its own.

I think it will help clearing the backlog, just not instantly - maybe several minutes until you start seeing the updates backlog going down.

If you don't wanna introduce a new PayloadKind yet (you will need it if you change the ids else you'll see a storm of decode errors), maybe use something even larger like 24h, just to guarantee we can flush all the junk from the backlog. Then you can set it back to a smaller value if you want, but really, it can be +infinity with no noticeable impact besides a little more ram usage that goes away as soon as you restart foca.

This won't affect "detecting" a node as down, right? We're thinking reliable detection of liveness is going to be useful very soon (it is already, but we're not using it yet).

Ah! The name of the configuration is pretty confusing, but no- remove_down_after has no impact on the ability to detect when a member goes down reliably.

It's more like a garbage collection thing. What it does is govern how long until foca allow a member with the same identity (including the bump) to rejoin the cluster after it was down. The only problem a very high value is that since the bump is initialised randomly if you get unlucky and generate a bump that was used previously, it will fail to join.

One simple way to prevent that is to init with something that increases over time, like seconds since the unix epoch (it's 32 bits, but you can use a different epoch, shift it, etc if you wanna squeeze this into fewer bits 😁). A random u16 is still fine, just gotta keep in mind that this case might happen.

Do you think a different name would help make this clear? Maybe something verbose like Config::allow_down_rejoin_after?

I think it will help clearing the backlog, just not instantly - maybe several minutes until you start seeing the updates backlog going down.

Ok, I'll give that a shot first. I assume I'll have to deploy this everywhere? Or should I be able to see the effect on a single node's updates backlog?

If you don't wanna introduce a new PayloadKind yet (you will need it if you change the ids else you'll see a storm of decode errors), maybe use something even larger like 24h, just to guarantee we can flush all the junk from the backlog. Then you can set it back to a smaller value if you want, but really, it can be +infinity with no noticeable impact besides a little more ram usage that goes away as soon as you restart foca.

I can do a new PayloadKind without much trouble. The harder / unrelated part is reducing the identity. A lot of stuff relies on knowing the extra metadata on the identity. It's a very large code change and then I have to figure out how to reliably propagate the extra metadata. This is likely going to be fine in the end, just more time than I can spend on it right now.

Do you think a different name would help make this clear? Maybe something verbose like Config::allow_down_rejoin_after?

It's fine, I think the current name is good πŸ˜„ I was just making sure I got how it worked right.

I will post an update soon.

That's just with the remove_down_after change to 2 days.

image

commented

Haha amazing! Zero with occasional tiny spikes is what we wanna see all the time.

I was way off on how long it would take, but I think it makes sense it sorted out this fast: since its backlog filled very fast, it also learned that all those nodes were really down immediately so once it sent these updates max_transmissions times, it never re-learned them. What I forgot to account for when thinking about this was the periodic gossip, it makes shipping cluster updates significantly faster. Pretty cool being wrong this time πŸ˜€

I'll definitely change the default remove_down_after values again

Regarding leaving the cluster on shutdown: what's the best way to do that? When we restart our program, there's no blue/green deploy strategy. It just shuts down gracefully and then a new instance comes up. They don't share the gossip address via SO_REUSEPORT or anything like that.

Basically, I'm wondering:

  • If the node leaves the cluster, it has to wait until it has transmitted that fact to the rest of the cluster for a little bit?
  • Should I overlap the 2 instances while the old one is telling the cluster it is leaving?

This is what happened to our "gossip bandwidth" with the new remove_down_after value (I have now deployed the change everywhere):

image

Even with a mostly empty updates backlog, a restart causes a very slow discovery of the whole cluster. I imagine smaller identities can help there, but how much will it really help? I could see fitting ~4-5x more messages in a single payload if everything is done right.

image

commented

I think there's no need to do anything fancy when leaving the cluster- The node will try to gossip to a few members the moment you call leave_cluster and reaching one is more than enough to spread the knowledge. It's also fine to overlap them, the only problem would be if they somehow end up with the exact same identity including the bump, because then the Down state wins.

Thinking it through: Given that there will be a new node with the same addr and you have periodic announce enabled, the new node will likely learn about the old identity and declare it down faster than a ping cycle would (and from a membership discovery perspective, the nodes are technically the same since it's the same addr so no real harm done). It's more useful for a shutdown scenario, where you need to take the member out of the roster permanently.

This gossip bandwidth is another sweet looking graph! 3MB/s is probably worse than a simple system where every node pings the whole cluster all the time πŸ˜… Now it's pretty clear why memberlist never removes their down nodes by default- a lot safer to default to strict and let users opt in on relaxing the constraint.

Even with a mostly empty updates backlog, a restart causes a very slow discovery of the whole cluster. I imagine smaller identities can help there, but how much will it really help? I could see fitting ~4-5x more messages in a single payload if everything is done right.

Right now I think the feed messages are fitting about 10 members each I'd say even doubling it would make it get close to the total number a lot faster. The members that are sent are random though, so converging to the total cluster size may take a while.

Another thing you can try is increasing the frequency/num_members for Config:: periodic_announce.

If you wanna discover all members super fast, the best approach is doing it externally: poke another member (via tcp maybe, so you can send everything at once without worry), get its members and foca.apply_many into the new instance. I think for this it would be best to use the correct values for Incarnation and State, but foca doesn't expose that - I'm happy changing it to expose the full Member struct so you can feed it directly to apply_many.

If you wanna discover all members super fast, the best approach is doing it externally: poke another member (via tcp maybe, so you can send everything at once without worry), get its members and foca.apply_many into the new instance. I think for this it would be best to use the correct values for Incarnation and State, but foca doesn't expose that - I'm happy changing it to expose the full Member struct so you can feed it directly to apply_many.

Sounds good. We definitely know about all nodes from our local, persistent, DB. We could pre-seed that list for sure! We do not have incarnation or any other foca-specific data about them though. Exposing this would be great. We could instantly learn about all other nodes.

commented

I'm baking a new release with an increased remove_down_after default (24h) and making foca expose State / Incarnation via iter_members. Will close this issue when done

As always, feel free to reopen 😁

foca expose State / Incarnation via iter_members.

Am I understanding correctly that we should be tracing the incarnation and state of members in our internal store so when we start we can set them as we last remembered them?

commented

I see two possible issues with using old data:

  1. It may increase the false positives (declaring a node as Down when it isn't), because the member being suspected may ignore the signal . It's unlikely to happen given that incarnation/state are always being kept in sync within the cluster, and not too harmful since members can auto-rejoin.
  2. It may lead to incorrect members count: your cold storage says member A is active; the cluster decides it's down; When B starts up with the cold data, it will be the only member in the cluster thinking A is active and will only remove it after pinging it, suspecting it and then finally declaring it down 1

So my answer is yes, probably πŸ˜…

The best way to use this would be in a synchronous scenario: you pick a random node from the cluster, ask it directly via tcp for this info. Then the issues above can't happen.

If you wanna rely on the internal store and wanna play it safe, it'd say only load members that are State::Alive so the risk of the issues above happening are minimised.

Footnotes

  1. I think we could implement something similar to what was done here, but notifying the sender instead to ensure this state gets fixed super fast, but it may increase the cluster chatter when nodes do down; Not as simple to implement, but not too difficult either. Not so sure how that would play out in a disaster scenario. I quite like the idea tho, will give it a think! ↩

Ok so we're now doing this, basically storing the member and their state in a persistent data store.

However, it looks like when restarting a cluster, it's possible to detect a node as down for a while. This is likely for the same reason propagating members takes a long time.

Example of this happening on startup when restarting all nodes in close temporal proximity:

  • Load all members from persistent store
  • apply_many all identities
  • This triggers all the MemberUp notifications
  • Announce to the cluster
  • If you get a MemberDown notification right away for a node, the counter goes down to 0 and therefore the node is "removed" from the in-memory list of members
  • It might take a long time for the new node's identity to reach the current node and therefore it stays marked as "down".

I'm not sure how to solve this exactly. My ideas are:

  • Add an extra grace period before marking the node as down, but this defeats the purpose a little bit
  • Make a smaller Identity so that information propagates faster
  • Tweak the config to gossip identities a lot more

I feel like the last option would be the easiest since we have bandwidth to spare.

commented

Interesting! I wonder if this indeed a "lost" member update (they changed identities, we got the previous one going down, never got the new identity going up) or something else I'm failing to see. Say: you're saving a partial snapshot and restoring it; does the counter based logic work well at all with this? i.e.: is this members HashMap<SocketAddr, usize> actually the same thing as constructing the same map via foca.iter_members()? I think so, but maybe I'm missing something...

Not sure about changing suspect_to_down_after: maybe it would make it happen less when you're staring at it, but if there's a problem with propagating information, I'd go for bumping max_transmissions instead of relaxing the down detection timer.

Smaller identities and more frequent announcing will definitely help with propagation, but as I write this I just realized one thing:

When crafting a Feed message (the reply to Announce), since there's some shuffle + round-robin dance happening during probe, foca relies on the current state of the members storage to pick which members to include in the reply. That works alright for simple scenarios, but your case has many nodes and large identities which means that the Feed message doesn't change very often (assuming a stable cluster, it will only change after num_members * probe_period)

I'll address that soon (tomorrow, likely) by making foca pick random members instead of relying on the internal state. Will definitely help discovering the whole cluster.

commented

release v0.10.1, making feed messages always randomize the chosen members ^

if the problem is indeed just discovering the last few members of the cluster, this should help speed up the discovery of the "lost" member(s)

It might've helped a bit, but it's hard to tell.

Getting a false down detection is pretty bad for what I've been cooking up and that's why I've been wondering about all of this.

In my latest attempt, I'm using an exponential backoff (no limits on retries) from 10ms to 2s (max) so when a node comes up it gossips a lot more and then less and less, eventually settling down, calling foca.gossip() every 2 seconds.

Unfortunately, that still sometimes means the cluster will get a Down event and then it'll take a few minutes before getting and Up from the same node, therefore detecting the node as alive again. In practice, these restarts are near instantaneous. I'm not calling leave_cluster btw, not sure if that matters. Maybe leaving does something special I'm not aware of.

Maybe I should try a different deploy strategy? I'm fully shutting down my program and then starting it again when I deploy. this takes ~5s in the worst of cases, but usually 1s at most.

We're using the default new_wan config, except we're setting a max_packet_size and a very long remove_down_after (2 days). We have ~417 nodes right now.

Looks like with a cluster of 417 and the wan config, we get a suspect_down_after value of 15.72s. However, when I restart 30 nodes at a time (during a deploy), I get many "down" nodes on various nodes. Not all of them. Some of them get "lucky" (I assume) and the message ordering / randomness is better for them.

Nodes are not down 15s seconds, they're down at most 5s if at all. I get that this is based on SWIM messages, so even if it's not actually down for that long, not getting a certain message within 15s means it will be detected as down.

Should I just try a longer suspect_down_after? Perhaps 120s (this is the value we use for our host monitoring, to determine if they're really down, but it doesn't involve anything like SWIM, an agent is just "pinging" the host).

commented

Interesting! Keep in mind that any cluster member can declare another down and messages may arrive unordered, doing something destructive based on that may lead to interesting scenarios. Right now you get Notification::MemberDown whenever foca learns that an identity is down, but maybe for this case you'd be more interested in the actual member(s) that initiated the whole process? i.e. everyone learns about a member being down, but it's just a few (usually just one) that actually declare them down.

I think the behavior you're seeing is caused by the effects of an unclean cluster exit: if you restart without foca.leave_cluster() you end up increasing the cluster size by one until someone decides to ping and detects that the old identity is down.

So when you restart a machine that knows about ID{old} but not ID{new} it leads to this situation of the counter going to zero and the effective cluster size going down; Then it takes a long while to discover ID{new}. My guess is that this is due to multiple things: the Feed bug from above + large identities + very fast rate of gossiping, removing the "knowledge" from the backlog too quickly, but it's hard to say.

Loose ideas:

  • If you can guarantee fast restarts, maybe instead of restarting with a new identity you reuse the current one (bump included). This way there should be one less down event happening.
  • Implement a tcp-based sync where you can send the full member set instead of just whatever can fit in a udp packet and use it for joining (instead of persisting to a db)
  • Try increasing Config::num_indirect_probes: false positives are bound to happen, but if it is related to flaky network, asking more peers to ping a suspect member for you should help
  • See how the cluster behaves with just Config::new_wan, no runtime config changes, no ad-hoc gossiping: so many little things changed over the past releases that I wonder if anything in place still stands on its own

RE down detection, 15s deadline, etc: it's sort of unavoidable given the nature of how the information propagates: if a node fails to reply to a ping or an indirect ping, they'll need to learn about their id being suspected via the random gossip propagation, then they'll have to refute it and this refutation will need to reach the one that originated the suspicion... the tail here is looooooong. memberlist has a mechanism (lifeguard) that sets this deadline to very high values and shrinks it based on confirmations from other peers - we could have that too, with some effort.

I think the behavior you're seeing is caused by the effects of an unclean cluster exit: if you restart without foca.leave_cluster() you end up increasing the cluster size by one until someone decides to ping and detects that the old identity is down.

Do you think this would help us significantly? I'm happy to add it! Does the cluster expect the node to rejoin shortly if it leaves and thus does not declare it as down?

  • If you can guarantee fast restarts, maybe instead of restarting with a new identity you reuse the current one (bump included). This way there should be one less down event happening.

I can try that.

How fast would it have to be? We're gracefully shutting down, but during that time we keep gossiping. When we start up, there are various steps that can occasionally take minutes (like restoring from a backup). In these cases I assume it would be best to renew the membership instead of reusing.

  • Try increasing Config::num_indirect_probes: false positives are bound to happen, but if it is related to flaky network, asking more peers to ping a suspect member for you should help

We definitely have many regions where networking isn't as solid and is prone to breaking more often. I can give that a shot too.

RE down detection, 15s deadline, etc: it's sort of unavoidable given the nature of how the information propagates: if a node fails to reply to a ping or an indirect ping, they'll need to learn about their id being suspected via the random gossip propagation, then they'll have to refute it and this refutation will need to reach the one that originated the suspicion... the tail here is looooooong. memberlist has a mechanism (lifeguard) that sets this deadline to very high values and shrinks it based on confirmations from other peers - we could have that too, with some effort.

That's interesting. I'm not saying you should do it though :)

The more I think about it, the more I think what we really want is to know which nodes are down from the point of view of the current node, not from the whole cluster. We're already pinging every node outside of foca to determine round-trip time and packet loss %. I think I'll probably add down detection there instead. The idea would be: some functionality needs to directly reach at other nodes, if they're down from the perspective of the current node we should use other "routes" to the same resources.

commented

I think the behavior you're seeing is caused by the effects of an unclean cluster exit: if you restart without foca.leave_cluster() you end up increasing the cluster size by one until someone decides to ping and detects that the old identity is down.

Do you think this would help us significantly? I'm happy to add it! Does the cluster expect the node to rejoin shortly if it leaves and thus does not declare it as down?

I think so. Not because there's any sort of logic expecting such pattern but because if you restart 30 nodes without leaving these 30 old identities will still be probed, will still receive updates, will still be possible targets for indirect probing, etc. You can think of it in terms of packet loss: every unclean exit increases the loss rate - if you restart the whole cluster, until every old identity is reaped you essentially have a 50% virtual packet loss (not counting the real network packet loss)

How fast would it have to be? We're gracefully shutting down, but during that time we keep gossiping. When we start up, there are various steps that can occasionally take minutes (like restoring from a backup). In these cases I assume it would be best to renew the membership instead of reusing.

If Config::notifity_down_members is true (it is, for new_wan()) and identities are renewable (the bump) you should be able to always reuse the identity, regardless of time. Then if a node reappears with a identity that's down, the cluster will notify them and foca will automatically change the id then join correctly.

Perhaps always reusing the previous id is the best strategy for your scenario since you know before startup every address that should be in the cluster. This way, you'd only see a down event on restart if it takes too long (hard to be precise here, the sweet spot to avoid a false down event is likely somewhere below probe_period + probe_rtt - network_rtt) and even so, it would recover quickly.

The more I think about it, the more I think what we really want is to know which nodes are down from the point of view of the current node, not from the whole cluster. We're already pinging every node outside of foca to determine round-trip time and packet loss %. I think I'll probably add down detection there instead. The idea would be: some functionality needs to directly reach at other nodes, if they're down from the perspective of the current node we should use other "routes" to the same resources.

Agreed RE routing/proxying. I think we should be able to rely on swim for actual down detection for your scenario if we switch to reusing the previous identities. Then the node responsible for alerting/(re)provisioning/etc could be just another cluster member with more powers - I think it would even be possible to lay out some node-to-node rtt estimation on top of the custom broadcast functionality. But, evidently, we're not there yet :D If you already have a system for healthchecks and whatnot, relying on it for down detection sounds best