V0.8.0 Test Plan
yukirora opened this issue · comments
Test Cases
single-node test
Machine Type | #Node * #GPU * GPU Type | PyTorch Version | Accelerated Computing Toolkit | Status |
---|---|---|---|---|
NDv5 SXM | 1* 8 * H100 | PyTorch 1.x | CUDA11.8 | Done |
ND A100 v4/NDm A100 v4 | 1 * 8 * A100 80GB SXM | PyTorch 1.x | CUDA 11.8 | |
ND A100 v4/NDm A100 v4 | 1 * 8 * A100 40GB SXM | PyTorch 1.8 | CUDA 11.1 |
Hopper GPU and FP8 related benchmarks
- microbenchmark
- e2e benchmark
SuperBench existing benchmark improvement
- microbenchmark improvement
- e2e benchmark improvement
- Fix torch.dist init issue with multiple models (#495)
CPU benchmark
SuperBench Improvement
- install pipeline
- monitor
- Support cgroup V2 when read system metrics in Monitor
multi-node test
Machine Type | #Node * #GPU * GPU Type | PyTorch Version | Accelerated Computing Toolkit | Status |
---|---|---|---|---|
NDv5 SXM | 2* 8 * H100 | PyTorch 1.x | CUDA11.8 |
Hopper GPU and FP8 related benchmarks
- microbenchmark
- Add distributed inference benchmark (#493)