sfu-mial / SlurmMonitor.jl

SLURM Monitor to detect node health and notify Slack in case of events. Mirror of https://github.com/bencardoen/SlurmMonitor.jl

Home Page:https://github.com/bencardoen/SlurmMonitor.jl

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SlurmMonitor

DOI

SlurmMonitor monitors SLURM (an HPC scheduler) based clusters for status, records the data over time, and if configured can act on predefined conditions.

Linking to Slack

** You need admin rights to do this, and do not create public endpoints without realizing what they (can) do**

  • Login to Slack
  • Settings and Admin
  • "Manage Apps"
  • "Build"
  • Create a new App
  • Activate new webhook

Test if link works

curl -X POST -H 'Content-type: application/json' --data '{"text":"Hello, World!"}' $URL

Installation

You install the monitor on a login node, and this assumes HPC admins are ok with you doing this.

git clone <thisrepo>
cd SlurmMonitor.jl

Then start julia

julia
julia> using Pkg; Pkg.add(".")

or

julia

then

julia>using Pkg; Pkg.activate() # Activate env in current dir, optional
julia>using Pkg; Pkg.add(url=<thisrepo>)

Test integration with slack

julia --project=.  # assuming you're in the cloned directory

Then

using SlurmMonitor
endpoint=readendpoint("endpoint.txt")
posttoslack("42 is the answer", endpoint)

That either posts the message, or tells you why it couldn't. make sure the format of the url is /services/.../.../.. See slack app configuration page on how to fix this if invalid.

Usage

The monitor polls at intervals i, repeating r times, with minimum acceptable latency l and saving to output dir o. Triggers (node going down, latency spikes), trigger optional messages to Slack e. It needs and endpoint file (1 line), with a endpoint (see earlier). You'd use this within a tmux/screen session to keep it in the background.

Example

Every minute, for 1e4 minutes, run the monitor, and call Solar Slack if issues arise.

julia --project=. src/monitor.jl -i 60 -r 10000 -o . -e endpoint_solar.txt -l 40

This will save a csv file, every z seconds, for k iterations, where 1 line represents the state of each node in the cluster, recording total/free CPU/RAM/GPU and node status (IDLE, ALLOC, ...).

On specified conditions (IDLE->DOWN) will send messages to a linked Slackbot, configured with the right endpoint.

If a node is not responsive (by network), a similar trigger is fired. Define the mininum average latency you consider as not-reachable in CLI.

Output

Saved to observed_state.csv. Do Not move the csv file, it's continuously read/written to See src/SlurmMonitor.jl, e.g. summarizestate($DATAFRAME, $ENDPOINT).

using Pkg
Pkg.activate(".")
using DataFrames
using CSV
df = CSV.read("where.csv", DataFrame)
endpoint = readendpoint("whereendpointis.txt")
summarizestate(df, endpoint) ## Sends to slack
plotstats(df)  ## Plots in svg

Dependencies

Warning

If you run this on a cluster, make sure you're authorized to do so. Calling scontrol and sinfo are RPC calls that cause a non-trivial load on the scheduler, if the cluster has 1000s of nodes, and you set the interval to 1s, that means 2000 RPC calls/1. Note that it takes several seconds, if not more, for a node to change state anyway. Do not do this unless you're a cluster admin. Sane intervals are ~ 60-120 or more seconds.

Extra functionality

  • Triggers can be anything, currently node state and latency are used
  • Diskusage, nvidia drivers, etc are all implemented, not active (can trigger ssh lockout)
  • Contact me if you need those active

Troublehshooting

Times seem wrong

Times are recorded in UTC. If you want this differently, it's not hard, I'd happily accept a properly documented PR.

Cite

If you find this useful, please cite

@software{ben_cardoen_2022_7106106,
  author       = {Ben Cardoen},
  title        = {{SlurmMonitor.jl: A Slurm monitoring tool that
                   notifies slack on adverse SLURM HPC state changes
                   and records temporal statistics on utilization.}},
  month        = sep,
  year         = 2022,
  note         = {https://github.com/bencardoen/SlurmMonitor.jl},
  publisher    = {Zenodo},
  version      = {0.1.0},
  doi          = {10.5281/zenodo.7106106},
  url          = {https://doi.org/10.5281/zenodo.7106106}
}

About

SLURM Monitor to detect node health and notify Slack in case of events. Mirror of https://github.com/bencardoen/SlurmMonitor.jl

https://github.com/bencardoen/SlurmMonitor.jl

License:GNU Affero General Public License v3.0


Languages

Language:Julia 98.5%Language:Shell 1.5%