ostinelli / syn

A scalable global Process Registry and Process Group manager for Erlang and Elixir.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mixed-Cluster Versioning

DmitryKakurin opened this issue · comments

We have a cluster of nodes running our product that is compiled with syn 2.0.1.
I'm trying to update nodes one by one (using rolling update) to a newer version of our product compiled with syn 2.1.1.
But new nodes fail to start because of the following syn failure:

2020-06-17 00:36:26.923 [info ] Syn(:"router@10.244.1.54"): Terminating with reason: {:function_clause, [{:syn_registry, :"-registry_automerge/2-fun-0-", [{"GlobalSelfScaler", #PID<37561.2151.0>, :undefined}], [file: '/opt/app/deps/prod/syn/src/syn_registry.erl', line: 648]}, {:lists, :foreach, 2, [file: 'lists.erl', line: 1338]}, {:syn_registry, :"-registry_automerge/2-fun-1-", 2, [file: '/opt/app/deps/prod/syn/src/syn_registry.erl', line: 652]}, {:global, :trans, 4, [file: 'global.erl', line: 425]}, {:syn_registry, :handle_info, 2, [file: '/opt/app/deps/prod/syn/src/syn_registry.erl', line: 406]}, {:gen_server, :try_dispatch, 4, [file: 'gen_server.erl', line: 637]}, {:gen_server, :handle_msg, 6, [file: 'gen_server.erl', line: 711]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 249]}]}
	| (pid=<0.1528.0> )
2020-06-17 00:36:26.923 [error] GenServer :syn_registry terminating
** (FunctionClauseError) no function clause matching in anonymous fn/1 in :syn_registry.registry_automerge/2
    (syn 2.1.1) /opt/app/deps/prod/syn/src/syn_registry.erl:648: anonymous fn({"GlobalSelfScaler", #PID<37561.2151.0>, :undefined}) in :syn_registry.registry_automerge/2
    (stdlib 3.12.1) lists.erl:1338: :lists.foreach/2
    (syn 2.1.1) /opt/app/deps/prod/syn/src/syn_registry.erl:652: anonymous fn/2 in :syn_registry.registry_automerge/2
    (kernel 6.5.2) global.erl:425: :global.trans/4
    (syn 2.1.1) /opt/app/deps/prod/syn/src/syn_registry.erl:406: :syn_registry.handle_info/2
    (stdlib 3.12.1) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib 3.12.1) gen_server.erl:711: :gen_server.handle_msg/6
    (stdlib 3.12.1) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:nodeup, :"router@10.244.4.31"}
State: {:state, :syn_event_handler, :undefined, :undefined, #PID<0.1530.0>}	| (pid=<0.1528.0> module=gen_server function=error_info/7 line=889 )
2020-06-17 00:36:26.924 [error] Process :syn_registry (#PID<0.1528.0>) terminating
** (FunctionClauseError) no function clause matching in anonymous fn/1 in :syn_registry.registry_automerge/2
    (syn 2.1.1) /opt/app/deps/prod/syn/src/syn_registry.erl:648: anonymous fn({"GlobalSelfScaler", #PID<37561.2151.0>, :undefined}) in :syn_registry.registry_automerge/2
    (stdlib 3.12.1) lists.erl:1338: :lists.foreach/2
    (syn 2.1.1) /opt/app/deps/prod/syn/src/syn_registry.erl:652: anonymous fn/2 in :syn_registry.registry_automerge/2
    (kernel 6.5.2) global.erl:425: :global.trans/4
    (syn 2.1.1) /opt/app/deps/prod/syn/src/syn_registry.erl:406: :syn_registry.handle_info/2
    (stdlib 3.12.1) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib 3.12.1) gen_server.erl:711: :gen_server.handle_msg/6
    (stdlib 3.12.1) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Initial Call: :syn_registry.init/1

That is correct, it's a feature added in 2.1.

@ostinelli what is the recommended way of migrating ever-running systems, that cannot be stopped, from 2.0 to 2.1 then?

For a distributed and gradual hot code upgrade (what you are asking), I need to provide a patch to syn. I need to evaluate the impact and see if I can grab the time to add it.

ATM syn must have the same version running on all nodes.

Thank you for considering a fix Roberto!
To possibly reduce the scope for your change, we don't use Erlang's hot code swapping. We have a cluster of multiple nodes, where the cluster as a whole must always be up, but individual nodes can go down at any moment. So we just do a rolling update to release new versions, where we stop nodes one by one and restart them running a newer version of our software.
Another thought: we would be OK doing 2-step upgrade, first to another 2.0.x version that is forward-compatible with 2.1, and then to 2.1.

I've looked at syn 2.0 vs 2.1 internal differences and implications to evaluate what you're asking. Syn has quite evolved between the two versions. Mainly: mnesia was dropped in favor of ETS, and syn now uses a better syncing mechanism (which has different internals). You can read the announcement here.

Mixed-cluster versioning (what you are asking) is a feature that needs to be built in from the beginning, and unfortunately cannot be achieved with a patch at this time. Upgrading the distributed functions of a cluster gradually is an understandable request, though up to now I've never needed to support it so it is not baked in syn v2.

The way clusters have been upgraded in production deployments that I've used (and seen by 3rd parties) has been to deploy a parallel cluster with all new versions (including syn), and then switch from the old to the new cluster (switch proxies configurations / change DNS resolution / ...). Once everything up, the old cluster would be shut down. This also allows to upgrade all of the things that do not support mixed-cluster versioning, for instance even Erlang versions itself (though AFAIK doable, I've had quite the quirks when using different versions in the same cluster).

I'm currently working on syn v3, and I am now versioning messages sent between nodes, like so. This should allow to have mixed-cluster versioning and hopefully cover this case.

Thank you for your input and patience, it will take me still a little while to complete.