deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Repository from Github https://github.comdeepseek-ai/DeepGEMMRepository from Github https://github.comdeepseek-ai/DeepGEMM

The reduction of GPU core frequency leads to significant fluctuations in GEMM performance

yaning223 opened this issue · comments

We encounter a problem when performing deepseek inference. When we overlap computation and communication, the power consumption exceeded the limit, resulting in GPU core frequency reduction (e.g. from 1980MHz to 1575 MHz), which caused large fluctuations in GEMM performance. We notice that similar situation does not exist in the deepseek profile, so we would like to ask if there is a corresponding solution?
We have tried locking the frequency to a lower level (e.g. 1575 MHz), but it resulted in a decrease in overall end-to-end performance.
Looking forward to your answer! Thank you!

It is expected and depends on the GPU's status: some GPUs have great manufacturing quality and the frequency can be higher (better performance), while some GPUs don't.

There are some software optimizations already merged (e.g. #74, larger L2 reuse, less L2 power consumption, better power and freq of SM cores) for power-related problems, but we don't have further plans to improve it (maybe there is no more solution on the software side).

And there could be also some changes on the system/driver, e.g. assign more powers to SMs and less to L2: https://developer.nvidia.com/blog/nvidia-sets-new-generative-ai-performance-and-scale-records-in-mlperf-training-v4-0/.

Diving further into the last optimization, a notable characteristic of LLM training is its high compute intensity. Especially for smaller-scale LLM runs, math operations can make up a much greater part of the time required to perform each training step compared to operations related to GPU-to-GPU communication. This leads to high Tensor Core utilization and can result in scenarios where Tensor Core throughput is constrained by the power available to the GPU.

In the submission with 512 H100 GPUs, we improved end-to-end performance by redirecting power from the L2 cache memory on each H100 GPU to the streaming multiprocessor (SM), which houses, among other units, NVIDIA Hopper fourth-generation Tensor Cores. This was done by setting a ratio using a boost slider managed by NVIDIA Management Libraries (NVML).

This resulted in higher GPU operating frequency within the same power budget and improved end-to-end performance by 4%. The boost slider can be set through the command nvidia-smi boost-slider –vboost . For more information about this command, including how to get all possible values, run nvidia-smi boost-slider –help.