rabbitmq / ra

A Raft implementation for Erlang and Elixir that strives to be efficient and make it easier to use multiple Raft clusters in a single system.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Handling of ra machine failures

mbj4668 opened this issue · comments

I have noticed that the system ends up in a bad state (?) if the apply callback crashes (erlang:error or erlang:exit).

If apply crashes, the ra_server_proc proccess exits, and is restarted by its supervisor (ra_server_sup), but then it fails again, and the supervisor reaches its max restart, and then its supervisor (ra_server_sup_sup) detects this. However, the child ra_server_sup has restart strategy temporary, so it just ignores this error. Here's an attempt to illustrate the supervision tree:

The system is called store.

ra_sup [one_for_one, max 1 restarts in 5 secs]
  +-- PERM ra_systems_sup [one_for_one, max 1 restarts in 5 secs]
  |          +-- PERM <0.195.0>/ra_system_sup [one_for_all, max 1 restarts in 5 secs]
  |                     +-- PERM ra_store_server_sup_sup/ra_server_sup_sup [simple_one_for_one, max 1 restarts in 5 secs]
  |                     |          +-- TEMP <0.241.0>/ra_server_sup [one_for_one, max 2 restarts in 5 secs]
  |                     |                     +-- TRAN store_ra/ra_server_proc
  |                     +-- PERM ra_store_log_sup/ra_log_sup [one_for_all, max 5 restarts in 5 secs]
  |                     |          +-- PERM <0.205.0>/ra_log_wal_sup [one_for_one, max 1 restarts in 5 secs]
  |                     |          |          +-- PERM ra_store_log_wal/ra_log_wal
  |                     |          +-- PERM ra_store_segment_writer/ra_log_segment_writer
  |                     |          +-- PERM ra_store_log_meta/ra_log_meta
  |                     |          +-- PERM <0.200.0>/ra_log_pre_init
  |                     +-- PERM ra_store_log_ets/ra_log_ets
  +-- PERM ra_file_handle
  +-- PERM ra_metrics_ets
  +-- PERM ra_machine_ets

I expected the error to propagate and eventually terminate the application. What is the intended way to handle these kinds of errors? My current workaround is to find the temporary supervisor (<0.241.0> above) and monitor it from another process, but this requires peeking into the internal state of ra, which doesn't seem quite right.