rom1504 / gpu-tester

gpu tester detects broken and slow gpus in a cluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

gpu_tester

pypi

Gpu tester finds all your bad gpus.

Works on slurm.

Features:

  • does a forward on each gpu
  • check for gpu returning incorrect results
  • check for gpu failing due to ECC errors

Roadmap:

  • sanity check forward speed
  • sanity check broadcast speed

Install

Create a venv:

python3 -m venv .env
source .env/bin/activate
pip install -U pip

Then:

pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu116
pip install gpu_tester

Python examples

Checkout these examples to call this as a lib:

Output

Output looks like this:

job succeeded
0 have incorrect results, 1 have gpu errors and 319 succeeded
incorrect results:
[]
gpu errors:
[['gpu_error', 'compute-od-gpu-st-p4d-24xlarge-156', '3']]

Recommended testing strategy

Pair based strategy

The easiest way to quickly spot broken node is to do the pair-based strategy. It will run many jobs in parallel and find which node can talk together Here is one example

gpu_tester --nodes 2 --parallel-tests 50 --job_comment laion --partition "gpu" --test_kind "ddp" --job_timeout 45 --exclude 'gpu-st-p4d-24xlarge-[66]'

All at once strategy

Once you validated this works, you may want to try the DDP strategy over all nodes, eg:

gpu_tester --nodes 100 --parallel-tests 1 --job_comment laion --partition "gpu" --test_kind "ddp" --job_timeout 300 --exclude 'gpu-st-p4d-24xlarge-[66]'

Simple forward

If you want to only validate the forward functionality of gpus and not the communication, you may use:

gpu_tester --nodes 100 --parallel-tests 1 --job_comment laion --partition "gpu" --test_kind "simple_forward" --job_timeout 50 --exclude 'gpu-st-p4d-24xlarge-[66]'

API

This module exposes a single function gpu_tester which takes the same arguments as the command line tool:

  • cluster the cluster. (default slurm)
  • job_name slurm job name. (default gpu_tester)
  • partition slurm partition. (default compute-od-gpu)
  • gpu_per_node numbe of gpu per node. (default 8)
  • nodes number of gpu nodes. (default 1)
  • output_folder the output folder. (default None which means current folder / results)
  • job_timeout job timeout (default 150 seconds)
  • job_comment optional comment arg given to slurm (default None)
  • job_account optional account arg given to slurm (default None)
  • test_kind simple_forward or ddp. simple_forward is quick forward test. DDP uses pytorch ddp to check gpu interconnect (default simple_forward)
  • parallel_tests number of tests to run in parallel. Recommended to use that with nodes == 2 to test pair by pair (default 1)
  • nodelist node whitelist, example 'gpu-st-p4d-24xlarge-[66-67]' (default None)
  • exclude node blacklist, example 'gpu-st-p4d-24xlarge-[66-67]' (default None)

For development

Either locally, or in gitpod (do export PIP_USER=false there)

Setup a virtualenv:

python3 -m venv .env
source .env/bin/activate
pip install -e .

to run tests:

pip install -r requirements-test.txt

then

make lint
make test

You can use make black to reformat the code

python -m pytest -x -s -v tests -k "dummy" to run a specific test

About

gpu tester detects broken and slow gpus in a cluster

License:MIT License


Languages

Language:Python 92.5%Language:Makefile 7.5%