V0.10.0 Test Plan
yukirora opened this issue · comments
Test Cases
single-node test
Machine Type | #Node * #GPU * GPU Type | Accelerated Computing Toolkit | Status |
---|---|---|---|
NDv5 SXM | 1* 8 * H100 | CUDA12.2 | done |
AMD MI200 | 1 * 16 * AMD MI200 | ROCM 5.7 | done |
AMD MI300x | 1 * 8 * AMD MI300x | ROCM 6.0 | done |
A100 and H100 related
- microbenchmark
- Bug fix for GPU Burn test (#567)
- Support INT8 in cublaslt function (#574)
- Support cpu-gpu and gpu-cpu in ib-validation (#581)
- Support graph mode in NCCL/RCCL benchmarks for latency metrics (#583)
- Benchmarks: Micro benchmark - Add one-to-all, all-to-one, all-to-all support to gpu_copy_bw_performance (#588)
- dist-inference cpp (#586)
- add msccl support (#584)
- Support in-place for NCCL/RCCL benchmark (#591)
- Model Benchmark Improvement
- Superbench improvement
- Update Docker image for H100 support (#577)
MI200 and MI300x
- microbenchmark improvement
- Add HPL random generator to gemm-flops with ROCm (#578)
- Update MLC version into 3.10 for CUDA/ROCm dockerfile (#562)
- Add hipBLASLt function benchmark (#576)
- Support cpu-gpu and gpu-cpu in ib-validation (#581)
- Support graph mode in NCCL/RCCL benchmarks for latency metrics (#583)
- Benchmarks: Micro benchmark - Add one-to-all, all-to-one, all-to-all support to gpu_copy_bw_performance (#588)
- dist-inference cpp (#586)
- Support in-place for NCCL/RCCL benchmark (#591)
- Model Benchmark Improvement
- Superbench improvement
- Support Monitoring for AMD GPUs (#580)
Result analysis
- Support baseline generation from multiple nodes (#575)