Names are not shared between nodes after a conflict resolution

Question

Names are not shared between nodes after a conflict resolution

lud opened this issue 3 years ago · comments

Testing with two nodes, after a conflict resolution, syn:whereis_name will sometimes return undefined on one of the two nodes. It can be the node where the process was shutdown (general case), it can be the other node, or it will also work as intended and the call on both node will return the same pid (in its local form and its remote form depending on the node the function is called).

Please see this commit where I tried to implement a test: lud@8c29218 .

Roberto Ostinelli · Answer 1 · Tue Nov 02 2021 21:30:20 GMT+0800 (China Standard Time)

You have an issue in your resolution method. You've defined it like this:

-module(syn_test_event_handler_resolution_stop).
-behaviour(syn_event_handler).

-export([resolve_registry_conflict/4]).

resolve_registry_conflict(_Scope, _Name, {Pid1, _, Time1}, {Pid2, _, Time2}) when Time1 > Time2 ->
    ok = proc_lib:stop(Pid1),
    Pid2;

resolve_registry_conflict(_Scope, _Name, {Pid1, _, Time1}, {Pid2, _, Time2}) when Time1 < Time2 ->
    ok = proc_lib:stop(Pid2),
    Pid1.

You need to think concurrent. With your definition, the method proc_lib:stop/1 will be called twice, once on the node1, and once on the node2. However, this method will work only the first time it is called: the second time it will raise an error as the process has already been killed, see here.

You may look at the logs, you should see things like:

SYN[syn_slave_1@rob] Error exit in custom handler resolve_registry_conflict: noproc

or

SYN[syn_slave_1@rob] Error exit in custom handler resolve_registry_conflict: {normal,
                                                                              {sys,
                                                                               terminate,
                                                                               [<8335.111.0>,
                                                                                normal,
                                                                                infinity]}}

As per the docs:

Important Note: the conflict resolution method will be called on the two nodes where the conflicting processes are running on. Therefore, this method MUST be defined in the same way across all nodes of the cluster and have the same effect regardless of the node it is run on, or you will experience unexpected results.

https://hexdocs.pm/syn/syn_event_handler.html

Therefore, the issue is that your method does not return the same thing on every node, as it crashes on one node. So, you get inconsistent results.

You can solve it by doing something like:

resolve_registry_conflict(_Scope, _Name, {Pid1, _, Time1}, {Pid2, _, Time2}) when Time1 >= Time2 ->
    _ = catch proc_lib:stop(Pid1),
    Pid2;

resolve_registry_conflict(_Scope, _Name, {Pid1, _, Time1}, {Pid2, _, Time2}) when Time1 < Time2 ->
    _ = catch proc_lib:stop(Pid2),
    Pid1.

or, better yet:

resolve_registry_conflict(_Scope, _Name, {Pid1, _, Time1}, {Pid2, _, Time2}) when Time1 >= Time2 ->
    case node(Pid1) =:= node() of
        true -> ok = proc_lib:stop(Pid1);
        false -> ok
    end,
    Pid2;

resolve_registry_conflict(_Scope, _Name, {Pid1, _, Time1}, {Pid2, _, Time2}) when Time1 < Time2 ->
    case node(Pid2) =:= node() of
        true -> ok = proc_lib:stop(Pid2);
        false -> ok
    end,
    Pid1.

I appreciate your input, but please consider my recommendation: if you need to do some cleanup that doesn't have to be in the process itself, you may consider using on_process_unregistered/5 instead, so that you don't have to mess with registry resolution yourself which is more complicated to handle properly.

Ludovic Dem · Answer 2 · Tue Nov 02 2021 21:57:38 GMT+0800 (China Standard Time)

I have absolutely no error reports with make test unfortunately. So alright, it seems too work fine, although a badly implemented handler will leave syn with an unconsistent state.

My process will have to cleanup itslef (finish monitoring stuff and exit, it just need to not exit quickly to let some other linked processes finish, there is no cleanup outside of the process.)

Sorry for the hassle.

Roberto Ostinelli · Answer 3 · Tue Nov 02 2021 22:03:00 GMT+0800 (China Standard Time)

I have absolutely no error reports with make test unfortunately.

You are using slave nodes so of course you need to look into their logs. One way to do so is by calling send_error_logger_to_disk/0 in the main syn application and then open the corresponding log files.

So alright, it seems too work fine, although a badly implemented handler will leave syn with an unconsistent state.

Nobody ever said that distributed programming is easy :) This is a constraint on the resolution mechanism, the alternative is to use global locks, which is something that syn v2 was using and it doesn't scale well.

My process will have to cleanup itslef (finish monitoring stuff and exit, it just need to not exit quickly to let some other linked processes finish, there is no cleanup outside of the process.)

Sounds good, you have a solution to do so posted here above.

Ludovic Dem · Answer 4 · Tue Nov 02 2021 22:07:04 GMT+0800 (China Standard Time)

Haha I did not know much about slave notes before yesterday. I also discovered that Ctrl+C'ing the tests will leave nodes alives and empd -stop does not stop them so I had to kill the OS processes.

Also I did not use the >= operator check time ordering because in my manual tests yesterday I could not be sure that the arguments had the same order on both nodes. Although there is metadata in such cases.