Timeouts from remote nodes don't have any consequence

Question

Timeouts from remote nodes don't have any consequence

ephe-meral opened this issue 8 years ago · comments

I'm using Syn in an environment where hosts are being dynamically deployed, i.e. they join and leave the cluster without manual configuration and without control from within the node. Sometimes this leads to inconsistencies, where process that were registered on a remote node will be still in the local table, but will not respond because the remote node doesn't respond.
This leads to timeouts, even when trying to unregister the name:

** (stop) exited in: :gen_server.call({:syn_registry, :"api@172.30.31.190"}, {:unregister_on_node, {:my_proc, "some-name"}})

Now I would assume that these timeouts eventually lead to a check of the node and it being removed from the known cluster, but apparently it just stays there and the entries will remain broken. Which leads (in my case) to an endless loop of start -> try to unregister the name (to be able to register oneself) -> timeout -> crash -> restart -> ...

So the question is, is there already some function I can call from the point of the node that detects the timeouts, or can we maybe improve syn so it can handle these situations more automatically?

Edit: Some more info on this: The node in question seems to be alive (i.e. running in a way that it's not kicked from the erlang cluster) but at least the app running our businesslogic crashed internally or hangs or what, which results in the timeouts.

Roberto Ostinelli · Answer 1 · Sat Nov 05 2016 01:13:51 GMT+0800 (China Standard Time)

Thank you for your report.

This should take care of your case. If it's not, then I will need a way to reproduce what you're seeing (in a test) so that we can fix this. Can you provide a way to reproduce these errors?

Johanna Appel · Answer 2 · Sat Nov 05 2016 02:05:31 GMT+0800 (China Standard Time)

Thanks for the hint, I'll need to evaluate this and report back :)
I also don't understand the full detail of why the node was still in the cluster but processes were not reachable, so I'll have to investigate that as well... but it seems it can happen when e.g. messages queue up in one process and memory runs out.

Roberto Ostinelli · Answer 3 · Tue Nov 08 2016 22:48:20 GMT+0800 (China Standard Time)

Ok, let me know.

Johanna Appel · Answer 4 · Fri Nov 11 2016 02:01:34 GMT+0800 (China Standard Time)

@ostinelli I tried to find out what the exact state of the stuck VM was, but it seems its impossible to reconstruct now. Since the node doesnt get kicked from the cluster there is probably also no best solution towards fixing this. The only thing we get is timeouts... we could escalate and at some point kick the node manually but I guess its not something that can be done by syn generically. So I guess this issue cannot be solved generically.

Roberto Ostinelli · Answer 5 · Fri Nov 11 2016 02:38:41 GMT+0800 (China Standard Time)

@ephe-meral not sure if this can help you, but I use Syn jointly with Cowbell which handles node timeouts and reconnections.

What kind of timeouts are you receiving? Maybe it would be interesting to upgrade Cowbell to kick a node and reconnect if some timeout is received. That would force a MNESIA down event and Syn would do the proper thing.

Roberto Ostinelli · Answer 6 · Fri Nov 11 2016 02:42:47 GMT+0800 (China Standard Time)

Closing for now, please reopen if you have more info.

Johanna Appel · Answer 7 · Fri Nov 11 2016 04:31:35 GMT+0800 (China Standard Time)

Ah thanks, so I'll see if cowbell fits that scenario then :)