Handling of ra machine failures
mbj4668 opened this issue · comments
I have noticed that the system ends up in a bad state (?) if the apply
callback crashes (erlang:error
or erlang:exit
).
If apply
crashes, the ra_server_proc
proccess exits, and is restarted by its supervisor (ra_server_sup
), but then it fails again, and the supervisor reaches its max restart, and then its supervisor (ra_server_sup_sup
) detects this. However, the child ra_server_sup
has restart strategy temporary
, so it just ignores this error. Here's an attempt to illustrate the supervision tree:
The system is called store
.
ra_sup [one_for_one, max 1 restarts in 5 secs]
+-- PERM ra_systems_sup [one_for_one, max 1 restarts in 5 secs]
| +-- PERM <0.195.0>/ra_system_sup [one_for_all, max 1 restarts in 5 secs]
| +-- PERM ra_store_server_sup_sup/ra_server_sup_sup [simple_one_for_one, max 1 restarts in 5 secs]
| | +-- TEMP <0.241.0>/ra_server_sup [one_for_one, max 2 restarts in 5 secs]
| | +-- TRAN store_ra/ra_server_proc
| +-- PERM ra_store_log_sup/ra_log_sup [one_for_all, max 5 restarts in 5 secs]
| | +-- PERM <0.205.0>/ra_log_wal_sup [one_for_one, max 1 restarts in 5 secs]
| | | +-- PERM ra_store_log_wal/ra_log_wal
| | +-- PERM ra_store_segment_writer/ra_log_segment_writer
| | +-- PERM ra_store_log_meta/ra_log_meta
| | +-- PERM <0.200.0>/ra_log_pre_init
| +-- PERM ra_store_log_ets/ra_log_ets
+-- PERM ra_file_handle
+-- PERM ra_metrics_ets
+-- PERM ra_machine_ets
I expected the error to propagate and eventually terminate the application. What is the intended way to handle these kinds of errors? My current workaround is to find the temporary supervisor (<0.241.0> above) and monitor it from another process, but this requires peeking into the internal state of ra, which doesn't seem quite right.