pdjr-skplugin-interfacewatchdog

Interface activity watchdog for Signal K.

Description

pdjr-skplugin-interfacewatchdog implements one or more watchdogs on one or more Signal K interfaces, triggering an exception if throughput on an interface falls below some specified threshold rate.

A sequence of contiguous exceptions is characterised as a problem and problems may result in the watchdog taking,a nd perhaps repeating, some action.

The sensitivity of a watchdog to exceptions can be configured and the appearance of a problem can be handled in a number of ways: it can be ignored (in which case monitoring continues), or the watchdog can be disabled, or the host Signal K server can be restarted in the hope that the problem can be corrected by a hard reset of the associated interface.

If server restarting is configured, the maximum number of allowed restarts can be limited to prevent a persistent loss of service resulting from runaway reboots on a dead interface.

Each watchdog logs key events to the server log and issues Signal K notifications on its own notification key.

The plugin exposes an HTTP API and contributes OpenAPI documentation of its interface to the Signal K OpenAPI service.

Configuration

The plugin configuration consists of a Watchdogs array containing zero or more Watchdog items each of which configures monitoring of a specified Signal K interface against a specified throughput threshold.

Interface name interface: Required string property specifying the Signal K interface that should be monitored. This must match one of the ID's displayed in the Signal K dashboard under Server -> Data Connections.
Watchdog name name: Optional string property giving a unique name that will be used in log and notification paths to identify this watchdog. Defaults to interface-n where n is an integer automatically assigned to ensure uniqueness.
Throughput threshold in deltas/s threshold: Optional integer data rate (in deltas per second) at or below which a problem should be logged. Defaults to 0 which will only identify interfaces that are completely dead.
Start taking action after this many problems startActionThreshold: Optionally integer specifying the number of problems that can be logged on *interface* before triggering the configured action (see below). A value of 0 says wait indefinitely and so disables the watchdog function on this interface. Defaults to 3.
Stop taking action after this many problems stopActionThreshold: If the number of problems logged on interface reaches this value then stop performing the configured action and stop watching this interface. The supplied value must be greater than *startActionThreshold*. Defaults to startActionThreshold + 3.
Action to take? action: The action to take on each problem event between startActionThreshold and stopActionThreshold. Must be one of 'none', suspend-watchdog' or 'restart-server'. Defaults to 'suspend-watchdog'.
Notification path notificationPath: Optional path under 'vessels.self.' on which the plugin should issue status notifications. If omitted, then the path 'notifications.plugins.interfacewatchdog.name' will be used.

There is no restriction on the number of times an interface can occur in the Watchdogs array so long as each Watchdog has a unique name (although it only makes sense if one Watchdog on a shared interface specifies a 'restart-server' action).

Example configuration

My ship has two NMEA busses bridged to a single Actisense interface called 'ngt-1'.

Bus0 is my 'domestic' NMEA bus and is expected to be available 24/7. Typical throughput on 'ngt-1' with just this bus enabled is around 20 deltas per second.

Bus1 is my 'navigation' bus and is expected to be available when navigating. Typical throughput on 'ngt1' with both busses enabled is around 60 deltas per second.

Setting appropriate threshold values on two Watchdog configurations allows me to monitor and notify the health of both data streams and to take crude remedial action if the 'ngt-1' interface dies.

{
  "configuration": {
    "watchdogs": [
      {
        "name": "Bus0",
        "interface": "ngt-1",
        "threshold": 10,
        "action": "restart-server"
      },
      {
        "name": "Bus1",
        "interface": "ngt-1",
        "threshold": 30,
        "action": "suspend-watchdog"
      }
    ]
  },
  "enabled": true,
  "enableDebug": false,
  "enableLogging": false
}

Notifications

Each defined Watchdog writes notifications either to its configured or default notificationPath.

Waiting for interface to become active: ALERT notification issued as soon as the watchdog begins watching interface throughput.
Started normal operation: NORMAL notification issued as soon as interface throughput rises above the specified watchdog threshold.
Server restart n of m: ALARM notification issued each time an exceptional throughput triggers a server restart.
Terminating watchdog: WARN notification issued when the watchdog stops monitoring its particular interface/threshold combination.

Operation

The plugin uses the Signal K SERVERINFO event mechanism as its basic processing heartbeat and its source of information on interface throughput. Typically, Signal K generates SERVERINFO events every four or five seconds.

An interface will be monitored by the plugin if its waitForActivity configuration property has a non-zero value.

If throughput on a monitored interface falls below and remains below the configured threshold value for waitForActivity heartbeats then the interface is considered to be in a problem state. The plugin will handle this condition in one of two ways dependent upon the value of the rebootLimit configuration property.

If rebootLimit is zero, then a 'warn' notification is issued on the configured notificationPath and the interface is removed from further monitoring.

If rebootLimit is non-zero, then the plugin will commence a sequence of server restarts up to the maximum configured by rebootLimit. After each restart the interface is monitored in the way described obove to determine whether or not the interface has been restored to a working state. A second or two before a restart sequence commences, an 'alert' notification is issued on notificationPath: this advance warning aims to allow an alarm handler or annunciator to detect the 'alert' condition and do its thing before the host server is restarted.

If the restart sequence fails to restore interface throughput above threshold, then a 'warn' notification is issued and the interface is removed from further monitoring.

If throughput on a problem interface recovers above threshold then a 'normal' notification is issued and monitoring proceeds as usual.

Be aware that a server restart is initiated by killing the parent Node process: Signal K will only restart automatically if, as will be the case after a normal installation, it is configured to be started by the host operating system's process manager.

Significant actions taken by the plugin are written to the server log.

Author

Paul Reeve <preeve_at_pdjr_dot_eu>

pdjr-signalk / pdjr-skplugin-interfacewatchdog