Watchdog timeout incorrectly set to infinity

Question

Watchdog timeout incorrectly set to infinity

nifoc opened this issue 4 years ago · comments

For some reason the watchdog timeout keeps being set to infinity in my application. Best I can tell, this can only happen if the WATCHDOG_USEC environment variable is unset or does not contain a valid integer.

When I try to debug this (by doing some rpc calls into the application) I can see the correct values for the environment variable.

I use distillery for deployment, but I don't think that's the cause.

systemd Unit:

[Unit]
Description=xxx
Wants=network-online.target
After=network.target network-online.target

[Service]
Type=notify
ExecStart=/etc/init.d/xxx systemd_foreground
User=userxxx
Group=users
WorkingDirectory=/home/userxxx
WatchdogSec=15
NotifyAccess=all
Restart=on-failure

[Install]
WantedBy=multi-user.target

watchdog state:

$ app.sh rpc ":systemd.watchdog(:state)"
:infinity

config.exs:

config :systemd,
  unset_env: false

WATCHDOG_USEC environment variable:

$ app.sh rpc ":os.getenv('WATCHDOG_USEC')"
'15000000'

WATCHDOG_USEC to integer conversion:

$ app.sh rpc ":os.getenv('WATCHDOG_USEC') |> :string.to_integer()"
{15000000, []}

systemd_sup:init/1 return value:

$ app.sh rpc ":systemd_sup.init([])"
{:ok,
 {%{strategy: :one_for_one},
  [
    %{
      id: :socket,
      start: {:systemd_socket, :start_link, [local: '/run/systemd/notify']}
    },
    %{id: :watchdog, start: {:systemd_watchdog, :start_link, [:infinity]}}
  ]}}

WATCHDOG_USEC and systemd_sup:init/1 in one call (to exclude any timing issues):

$ app.sh rpc "[:os.getenv('WATCHDOG_USEC'), :systemd_sup.init([])]"
[
  '15000000',
  {:ok,
   {%{strategy: :one_for_one},
    [
      %{
        id: :socket,
        start: {:systemd_socket, :start_link, [local: '/run/systemd/notify']}
      },
      %{id: :watchdog, start: {:systemd_watchdog, :start_link, [:infinity]}}
    ]}}
]

Łukasz Jan Niemier · Answer 1 · Tue Feb 04 2020 18:47:56 GMT+0800 (China Standard Time)

What are the :os.getpid() and :os.getenv('WATCHDOG_PID') values?

Daniel Kempkens · Answer 2 · Tue Feb 04 2020 21:51:15 GMT+0800 (China Standard Time)

:os.getpid() and :os.getenv('WATCHDOG_PID'):

$ app.sh rpc "[:os.getpid(), :os.getenv('WATCHDOG_PID')]"
['24553', '24545']

$ ps faux
ubnt     24545  0.0  0.0   3376  2432 ?        Ss   13:42   0:00 /bin/sh /etc/init.d/xxx systemd_foreground
ubnt     24548  0.0  0.0   5032  2932 ?        S    13:42   0:00  \_ bash -c ...
ubnt     24549  0.0  0.0   5036  2924 ?        S    13:42   0:00      \_ /bin/sh ...
ubnt     24553 88.7  0.4 224944 77152 ?        Sl   13:42   0:06          \_ /home/ubnt/app/releases/1.7.1/app.sh ...
[...]

(I think this should be fine, because the .service sets NotifyAccess=all)

Other things that may be useful:

$ app.sh rpc ":os.getenv('NOTIFY_SOCKET')"
'/run/systemd/notify'

$ app.sh rpc ":systemd_watchdog |> Process.whereis() |> :sys.get_state()"
{:state, :infinity, true}

$ app.sh rpc ":systemd_watchdog |> Process.whereis() |> :sys.get_status()"
{:status, #PID<12066.2358.0>, {:module, :gen_server},
 [
   [
     "$ancestors": [:systemd_sup, #PID<12066.2355.0>],
     "$initial_call": {:systemd_watchdog, :init, 1}
   ],
   :running,
   #PID<12066.2356.0>,
   [],
   [
     header: 'Status for generic server systemd_watchdog',
     data: [
       {'Status', :running},
       {'Parent', #PID<12066.2356.0>},
       {'Logged events', []}
     ],
     data: [{'State', {:state, :infinity, true}}]
   ]
 ]}

Łukasz Jan Niemier · Answer 3 · Tue Feb 04 2020 22:16:49 GMT+0800 (China Standard Time)

Ok, so I see the problem there. The PID of the BEAM VM is different from the watched PID:

$ app.sh rpc "[:os.getpid(), :os.getenv('WATCHDOG_PID')]"
['24553', '24545']

And man 3 sd_watchdog_enabled explicitly say that:

If the $WATCHDOG_USEC environment variable is set, and the $WATCHDOG_PID variable is unset or set to the PID of the current process, the service manager expects notifications from this process.

Due to that the default interval is set to infinity and it is marked as a disabled. Now I think that it could be set to be disabled, but the interval still would be set to the passed value, however that would have different behaviour from the systemd implementation.

From what I see the main problem is that you are using SysV-like init file and that is the PID watched, not the BEAM VM:

$ ps faux
ubnt     24545  0.0  0.0   3376  2432 ?        Ss   13:42   0:00 /bin/sh /etc/init.d/xxx systemd_foreground
ubnt     24548  0.0  0.0   5032  2932 ?        S    13:42   0:00  \_ bash -c ...
ubnt     24549  0.0  0.0   5036  2924 ?        S    13:42   0:00      \_ /bin/sh ...
ubnt     24553 88.7  0.4 224944 77152 ?        Sl   13:42   0:06          \_ /home/ubnt/app/releases/1.7.1/app.sh ...

I have tested the application using mix release, but I can also try to setup tests with Distillery and Relx, however as I do not use them currently then it probably will be left as "help wanted".

Daniel Kempkens · Answer 4 · Tue Feb 04 2020 22:41:10 GMT+0800 (China Standard Time)

And man 3 sd_watchdog_enabled explicitly say that:
[…]

Thank you for that! I must've misremembered what exactly NotifyAccess does.

[…] but I can also try to setup tests with Distillery and Relx, however as I do not use them currently then it probably will be left as "help wanted".

For stupid reasons the app I'm working on has a bunch of extra bash invocations between ExecStart= and the actual BEAM. However, I'm fairly certain that at least one of them (PID 24553 above) is always "added" by distillery.

Anyway, thanks again for you help! I think this is can be resolved as user error 🙂

Łukasz Jan Niemier · Answer 5 · Wed Feb 05 2020 02:14:43 GMT+0800 (China Standard Time)

I have created #14 that will always set the interval and will just not enable the watchdog if the PID is not the same. In that way user will be still able to force it if needed.