Watchdog timeout incorrectly set to infinity
nifoc opened this issue · comments
For some reason the watchdog timeout keeps being set to infinity
in my application. Best I can tell, this can only happen if the WATCHDOG_USEC
environment variable is unset or does not contain a valid integer.
When I try to debug this (by doing some rpc
calls into the application) I can see the correct values for the environment variable.
I use distillery
for deployment, but I don't think that's the cause.
systemd Unit:
[Unit]
Description=xxx
Wants=network-online.target
After=network.target network-online.target
[Service]
Type=notify
ExecStart=/etc/init.d/xxx systemd_foreground
User=userxxx
Group=users
WorkingDirectory=/home/userxxx
WatchdogSec=15
NotifyAccess=all
Restart=on-failure
[Install]
WantedBy=multi-user.target
watchdog
state:
$ app.sh rpc ":systemd.watchdog(:state)"
:infinity
config.exs
:
config :systemd,
unset_env: false
WATCHDOG_USEC
environment variable:
$ app.sh rpc ":os.getenv('WATCHDOG_USEC')"
'15000000'
WATCHDOG_USEC
to integer conversion:
$ app.sh rpc ":os.getenv('WATCHDOG_USEC') |> :string.to_integer()"
{15000000, []}
systemd_sup:init/1
return value:
$ app.sh rpc ":systemd_sup.init([])"
{:ok,
{%{strategy: :one_for_one},
[
%{
id: :socket,
start: {:systemd_socket, :start_link, [local: '/run/systemd/notify']}
},
%{id: :watchdog, start: {:systemd_watchdog, :start_link, [:infinity]}}
]}}
WATCHDOG_USEC
and systemd_sup:init/1
in one call (to exclude any timing issues):
$ app.sh rpc "[:os.getenv('WATCHDOG_USEC'), :systemd_sup.init([])]"
[
'15000000',
{:ok,
{%{strategy: :one_for_one},
[
%{
id: :socket,
start: {:systemd_socket, :start_link, [local: '/run/systemd/notify']}
},
%{id: :watchdog, start: {:systemd_watchdog, :start_link, [:infinity]}}
]}}
]
What are the :os.getpid()
and :os.getenv('WATCHDOG_PID')
values?
:os.getpid()
and :os.getenv('WATCHDOG_PID')
:
$ app.sh rpc "[:os.getpid(), :os.getenv('WATCHDOG_PID')]"
['24553', '24545']
$ ps faux
ubnt 24545 0.0 0.0 3376 2432 ? Ss 13:42 0:00 /bin/sh /etc/init.d/xxx systemd_foreground
ubnt 24548 0.0 0.0 5032 2932 ? S 13:42 0:00 \_ bash -c ...
ubnt 24549 0.0 0.0 5036 2924 ? S 13:42 0:00 \_ /bin/sh ...
ubnt 24553 88.7 0.4 224944 77152 ? Sl 13:42 0:06 \_ /home/ubnt/app/releases/1.7.1/app.sh ...
[...]
(I think this should be fine, because the .service
sets NotifyAccess=all
)
Other things that may be useful:
$ app.sh rpc ":os.getenv('NOTIFY_SOCKET')"
'/run/systemd/notify'
$ app.sh rpc ":systemd_watchdog |> Process.whereis() |> :sys.get_state()"
{:state, :infinity, true}
$ app.sh rpc ":systemd_watchdog |> Process.whereis() |> :sys.get_status()"
{:status, #PID<12066.2358.0>, {:module, :gen_server},
[
[
"$ancestors": [:systemd_sup, #PID<12066.2355.0>],
"$initial_call": {:systemd_watchdog, :init, 1}
],
:running,
#PID<12066.2356.0>,
[],
[
header: 'Status for generic server systemd_watchdog',
data: [
{'Status', :running},
{'Parent', #PID<12066.2356.0>},
{'Logged events', []}
],
data: [{'State', {:state, :infinity, true}}]
]
]}
Ok, so I see the problem there. The PID of the BEAM VM is different from the watched PID:
$ app.sh rpc "[:os.getpid(), :os.getenv('WATCHDOG_PID')]"
['24553', '24545']
And man 3 sd_watchdog_enabled
explicitly say that:
If the
$WATCHDOG_USEC
environment variable is set, and the$WATCHDOG_PID
variable is unset or set to the PID of the current process, the service manager expects notifications from this process.
Due to that the default interval is set to infinity
and it is marked as a disabled. Now I think that it could be set to be disabled, but the interval still would be set to the passed value, however that would have different behaviour from the systemd implementation.
From what I see the main problem is that you are using SysV-like init file and that is the PID watched, not the BEAM VM:
$ ps faux
ubnt 24545 0.0 0.0 3376 2432 ? Ss 13:42 0:00 /bin/sh /etc/init.d/xxx systemd_foreground
ubnt 24548 0.0 0.0 5032 2932 ? S 13:42 0:00 \_ bash -c ...
ubnt 24549 0.0 0.0 5036 2924 ? S 13:42 0:00 \_ /bin/sh ...
ubnt 24553 88.7 0.4 224944 77152 ? Sl 13:42 0:06 \_ /home/ubnt/app/releases/1.7.1/app.sh ...
I have tested the application using mix release
, but I can also try to setup tests with Distillery and Relx, however as I do not use them currently then it probably will be left as "help wanted".
And man 3 sd_watchdog_enabled explicitly say that:
[…]
Thank you for that! I must've misremembered what exactly NotifyAccess
does.
[…] but I can also try to setup tests with Distillery and Relx, however as I do not use them currently then it probably will be left as "help wanted".
For stupid reasons the app I'm working on has a bunch of extra bash
invocations between ExecStart=
and the actual BEAM. However, I'm fairly certain that at least one of them (PID 24553
above) is always "added" by distillery
.
Anyway, thanks again for you help! I think this is can be resolved as user error 🙂
I have created #14 that will always set the interval and will just not enable the watchdog if the PID is not the same. In that way user will be still able to force it if needed.