[BUG] - SIGHUP during ledger replay in node 8.2.0 pre-release is treated as SIGTERM

Question

[BUG] - SIGHUP during ledger replay in node 8.2.0 pre-release is treated as SIGTERM

johnalotoski opened this issue a year ago · comments

Internal/External
Internal

Area
Other -- signal handling

Summary
In node version <= 8.1.2, SIGHUP causes topology reload, including during ledger replay, example:

# During ledger replay on node 8.1.2:
...
[nixos-p7:cardano.node.ChainDB:Info:5] [2023-07-28 17:16:23.12 UTC] Replayed block: slot 1706373 out of 23897971. Progress: 7.14%
[nixos-p7:cardano.node.ChainDB:Info:5] [2023-07-28 17:16:23.21 UTC] Replayed block: slot 1710719 out of 23897971. Progress: 7.16%
[nixos-p7:cardano.node.ChainDB:Info:5] [2023-07-28 17:16:23.34 UTC] Replayed block: slot 1715006 out of 23897971. Progress: 7.18%
[nixos-p7:cardano.node.startup:Notice:53] [2023-07-28 17:16:23.41 UTC] Performing topology configuration update
[nixos-p7:cardano.node.startup:Info:53] [2023-07-28 17:16:23.41 UTC] 
Local Root Groups:
  (1,[])
Public Roots:
  (RelayAccessDomain "preview-node.world.dev.cardano.org" 30002,DoNotAdvertisePeer)
Get root peers from the ledger after slot 322000
[nixos-p7:cardano.node.ChainDB:Info:5] [2023-07-28 17:16:23.45 UTC] Replayed block: slot 1719305 out of 23897971. Progress: 7.19%
[nixos-p7:cardano.node.ChainDB:Info:5] [2023-07-28 17:16:23.54 UTC] Replayed block: slot 1723675 out of 23897971. Progress: 7.21%
[nixos-p7:cardano.node.ChainDB:Info:5] [2023-07-28 17:16:23.63 UTC] Replayed block: slot 1727975 out of 23897971. Progress: 7.23%
...

On node 8.2.0, SIGHUP is now acting as a kill signal during a ledger replay:

# Example on node 8.2.0
[nixos-p7:cardano.node.ChainDB:Info:5] [2023-07-28 17:26:57.08 UTC] Replayed block: slot 881219 out of 23898365. Progress: 3.69%
[nixos-p7:cardano.node.ChainDB:Info:5] [2023-07-28 17:26:57.12 UTC] Replayed block: slot 885591 out of 23898365. Progress: 3.71%
[nixos-p7:cardano.node.ChainDB:Info:5] [2023-07-28 17:26:57.15 UTC] Replayed block: slot 889907 out of 23898365. Progress: 3.72%
<END -- no other output upon SIGHUP issue>

# Strace shows:
1594166 17:08:11.468441 pread64(23, "\202\6\205\202"..., 13408, 889993) = 13408
1594166 17:08:11.469167 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=1, si_uid=65534} ---
1594316 17:08:11.469342 <... futex resumed>) = ?
1594180 17:08:11.469374 <... futex resumed>) = ?
1594179 17:08:11.469587 <... futex resumed>) = ?
1594178 17:08:11.469611 <... futex resumed>) = ?
1594177 17:08:11.469629 <... poll resumed> <unfinished ...>) = ?
1594176 17:08:11.469645 <... epoll_wait resumed> <unfinished ...>) = ?
1594175 17:08:11.469695 <... epoll_wait resumed> <unfinished ...>) = ?
1594316 17:08:11.470375 +++ killed by SIGHUP +++
1594180 17:08:11.470408 +++ killed by SIGHUP +++
1594179 17:08:11.470420 +++ killed by SIGHUP +++
1594178 17:08:11.470431 +++ killed by SIGHUP +++
1594177 17:08:11.470442 +++ killed by SIGHUP +++
1594176 17:08:11.470453 +++ killed by SIGHUP +++
1594175 17:08:11.470464 +++ killed by SIGHUP +++
1594174 17:08:11.470475 <... read resumed> <unfinished ...>) = ?
1594174 17:08:11.496745 +++ killed by SIGHUP +++
1594166 17:08:11.496832 +++ killed by SIGHUP +++

Steps to reproduce
One way to reproduce the behavior:

mkdir -p repro/db
cd repro

# Obtain 8.1.2 chain state, for example on preview.
# These cmds should work for those on x86_64-linux with nix installed with flake support.
# If you already have preview chain state you can copy it to the repro/db dir to avoid sync time.
# You just need to sync enough chain state so that when you do a ledger replay with this state for version 8.2,
# there is enough time to issue a SIGHUP before the ledger replay is done.
# Syncing to slot ~3,000,000 will give about 1 minute of ledger replay time to issue a signal for testing.
# CTRL-C the syncing when you have enough 8.1.2 chain state synced.
ENV=preview
ENVIRONMENT=$ENV \
DATA_DIR=./db \
SOCKET_PATH=$(pwd)/node.socket \
nix run github:input-output-hk/cardano-world#x86_64-linux.cardano.entrypoints.cardano-node \
  | tee -a node-$ENV-8.1.2.log

# From the same directory, run essentially the same command but with a branch using node 8.2.0:
ENV=preview
ENVIRONMENT=$ENV \
DATA_DIR=./db \
SOCKET_PATH=$(pwd)/node.socket \
nix run github:input-output-hk/cardano-world/sl/cardano-cli-input#x86_64-linux.cardano.entrypoints.cardano-node \
  | tee -a node-$ENV-8.2.0.log

# While the command above has node 8.2.0 run a ledger replay, from another shell,
# issue a SIGHUP and observe the outcome
pkill cardano-node --signal SIGHUP

# To verify on 8.1.2 this is not the behavior, allow the 8.2.0 node command above
# to complete the ledger replay, then CTRL-C to stop it.
# From the same directory, re-run the first command above so that node 8.1.2 replays the 8.2.0 chain state.
# Send a SIGHUP signal during node 8.1.2 ledger replay and observe.

Expected behavior

SIGHUP to cause a topology reload, including during ledger state, or at least to not act as SIGTERM during ledger replay causing an exit.

System info (please complete the following information):

OS Name: NixOS
OS Version 23.05
Node version (output of cardano-node --version)

❯ cardano-node --version
cardano-node 8.2.0 - linux-x86_64 - ghc-8.10
git rev 408d8ae10a2792ace3a822e312433960e47de4e9

CLI version (output of cardano-cli --version)

❯ cardano-cli --version
cardano-cli 8.4.0.0 - linux-x86_64 - ghc-9.2
git rev 0000000000000000000000000000000000000000

Additional context

This new behavior persists both with and without the new --non-producing-node flag, and with or without the associated block producer secrets.
In some clusters, we use service discovery for peers and automatically issue SIGHUP periodically when the discovered topology file hash changes. This frequently happens during the first few minutes of a cluster deployment as nodes are changing their ip, port, etc as new peers are deployed.
With this new behavior, if a ledger replay is in progress, the node(s) gets killed before they can complete the ledger replay due to peer updates that are happening in parallel on the network and a topology watcher process issuing SIGHUPs.
One workaround for this: we can obtain 8.2.0 chain state which has already finished the ledger replay and copy it manually to the jobs so they avoid being killed, which is rather tedious. Another would be to query node to try and discover if it is in a ledger replay and add logic to avoid the topology watcher issuing SIGHUPs during ledger replay.
Preferably, we will return to prior behavior where SIGHUP wasn't terminal during ledger replay.

John A. Lotoski commented a year ago

Thanks!

Marcin Szamotulski · Answer 1 · Sat Jul 29 2023 03:17:49 GMT+0800 (China Standard Time)

@johnalotoski do you have a log from ~~8.1.2~~ 8.2.0 which includes the startup tracer?

John A. Lotoski · Answer 2 · Sat Jul 29 2023 04:31:15 GMT+0800 (China Standard Time)

Which config (or other) setting is the startup tracer? So far I have tried with the default environment config which is pulled from iohk-nix. Happy to enable the startup trace and get some more helpful logging.

John A. Lotoski · Answer 3 · Sat Jul 29 2023 04:45:04 GMT+0800 (China Standard Time)

Hi @coot: I'm attaching the default log output for 8.2.0 with a SIGHUP issued which is where the log stops in case the startup tracer you mention is already enabled by default:
node-preview-8.2.0-test-sighup-std-config.log

Marcin Szamotulski · Answer 4 · Mon Jul 31 2023 17:26:12 GMT+0800 (China Standard Time)

There are cardano.node.startup messages. It seems like the handler is not installed at all, since it's not logging when SIGHUP was received.

Armando Santos · Answer 5 · Mon Jul 31 2023 17:57:02 GMT+0800 (China Standard Time)

@coot pinpointed the issue to the signal handler being previously set on node startup, now it is set when nodeKernelHook runs (i.e. after chain replay). We can set the SIGHUP handler twice, first when we start the node and then again when we run he hook. We just need to warn that in the first one that block forging will not be changed.

However I don't think this is something that's worth workaround, it seems a very slight issue that the end user can just wait, there are a bunch of other things that aren't available before chain replay like local sockets because they only are initialized after, for some reason

Marcin Szamotulski · Answer 6 · Mon Jul 31 2023 21:09:36 GMT+0800 (China Standard Time)

We decided the issue will be fixed, hence #5421.

Armando Santos · Answer 7 · Tue Aug 01 2023 15:03:18 GMT+0800 (China Standard Time)

@johnalotoski has reported #5421 to have fixed the issue