T-Atlas / gpu-cluster-monitor

HTML interface to display GPU statuses of multiple nodes.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GPU Cluster Status Monitor

Introduction

This project aims to provide a HTML interface to display GPU usages of multiple nodes. As shown in Figure 1. The main node poll each GPU node to collect their statuses and make them avaiable to public via API.


Figure 1. Framework

Quick Start

Install packages (tested with python=3.9).

pip install -r requirments.txt

Assume we have a cluster with two nodes (node_1: 192.168.0.1; node_2: 192.168.0.2) and node_1 is the main node to display cluster information.

First, run the script node_info.py on each node to start a Flask process. Then node status can be obtained through the api on port 7080 (default):

# on node_1, node_2
python node_info.py

Before start the cluster monitor, save all node IP addresses in a txt file:

> hosts.txt # clear hosts.txt
echo 192.168.0.1 >> hosts.txt
echo 192.168.0.2 >> hosts.txt

Second, run the interface API on node_1.

python api.py -c hosts.txt --port 7070

Then visit http://192.168.0.1:7070 in Chrome.

Customize the port

# on each node
python node_info.py --port <node_api_port> --disable_log

# on main node
python api.py -c hosts.txt --port <main_api_port> --node_port <node_api_port>

# visit http://<main_node_ip>:<main_api_port>

Specifications

Password

A password is required to get node status, defaulting to '8888'. To change the password, modify the global variable PASSWD in node_info.py and api.py.

Get node status (json)

Given node_info.py running on 192.168.0.3:7080, node status data can be acquired by python:

import requests

res = requests.post(f'http://192.168.0.3:7080/get-status', json = {'passwd': '8888'})

print(json.dumps(res.json(), indent = 4))

The structure of node status:

{
    "hostname": (`str`)
    "last_update": (`str`) isoformat, e.g., "2023-04-29T21:17:41.419592"
    "ips": List[Tuple[interface, ip]], e.g., [["eno1", "192.168.0.3"]]
    "gpus": [
        {
            "index": (`int`)
            "name": (`str`) gpu brand
            "use_mem": (`int`) used memory in MiB
            "tot_mem": (`int`) total memory in MiB
            "utilize": (`int`) utilization percent
            "temp": (`int`) temperature
            "index": (`int`)
            "users": [{"pid": 123, "username": xx, "mem(MiB)": 1024, "command": xx}, ...]
        },
        ...
    ]
    ""
}

Or print node status data in command line by running

python node_info.py --debug

About

HTML interface to display GPU statuses of multiple nodes.


Languages

Language:Python 61.7%Language:JavaScript 18.5%Language:CSS 10.1%Language:HTML 9.6%