V0.7.0 Release Plan

Question

cp5555 opened this issue 2 years ago · comments

Release Manager

- Support non-zero return code when “sb deploy” or “sb run” fails in Ansible (Related to #410 and #411) (#425)
- Support log flushing to the result file during runtime (Related to #390) (#445)
- Update version to include revision hash and date (#427)
- Support 'pattern' in 'mpi' mode to run tasks in parallel (#430, #458)
- Support topo-aware, all-pair, and K-batch pattern in 'mpi' mode (#437， #447)
- Fix Transformers version to avoid Tensorrt failure (#441)
- Add CUDA11.8 Docker image for Nvidia arch90 GPUs (#449)
- Support sb deploy without docker pulling (#466)

- Support list of custom config string in cudnn-functions and cublas-functions (#414)
- Support correctness check in cublas-functions (#450, #452)
- Support GEMM-FLOPS for Nvidia arch90 GPUs (#456)
- Support cuBLASLt FP16 and FP8 GEMM (#451, #455, #460)
- Add wait time option to resolve mem-bw unstable issue (#438)
- Fix bug for incorrect datatype judgement in cublas-function source code. (#462)

- Support pair-wise pattern in IB validation benchmark. (#453)
- Support topo-aware, pair-wise, and K-batch pattern in nccl-bw benchmark. (#454)