[Issue]: Creating two HIP streams causes 100% GPU utilization

Question

[Issue]: Creating two HIP streams causes 100% GPU utilization

pxl-th opened this issue 6 months ago · comments

Problem Description

Creating two HIP streams causes 100% GPU utilization.
This is observed on ROCm 5.7-6.0 and on RX 7600, RX 7800 XT and RX 7900 XTX (at least).

Here's the utilization graph using resources during the execution of C++ MWE below (this is observed with rocm-smi as well):

Operating System

Ubuntu 22.04.3 LTS (Jammy Jellyfish)

CPU

AMD Ryzen 7 5800X 8-Core Processor

GPU

AMD Radeon RX 7900 XT

ROCm Version

ROCm 6.0.0

Steps to Reproduce

C++ MWE:

#include <hip/hip_runtime.h>
#include <iostream>
#include <chrono>
#include <thread>

using namespace std;

void check(int res) {
    if (res != 0) {
        std::cerr << "Fail" << std::endl;
    }
}

int main(int argc, char* argv[]) {
    hipStream_t s1;
    check(hipStreamCreateWithPriority(&s1, 0, 0));

    hipStream_t s2;
    check(hipStreamCreateWithPriority(&s2, 0, 0));

    std::this_thread::sleep_for(std::chrono::seconds(5));
    return 0;
}

Compile with hipcc main.cpp & run a.out and observe utilization during program execution.

Jatin Chaudhary · Answer 1 · Mon Jan 15 2024 18:00:47 GMT+0800 (China Standard Time)

I could not reproduce it on Navi21(6900 XT).

rocm-smi reads the data from the driver to populate percent usage. Will forward this to relevant teams to get more information.

Anton Smirnov · Answer 2 · Mon Jan 15 2024 18:03:30 GMT+0800 (China Standard Time)

This looks to be a Navi 3 issue. I was also not able to reproduce it on RX6700 XT.

Anton Smirnov · Answer 3 · Tue Jan 16 2024 17:51:19 GMT+0800 (China Standard Time)

If this is not a monitoring bug, it might partially explain, why we are seeing random hangs in our AMDGPU.jl CI only with Navi 3, because tests run on multiple workers using multiple streams.

Anton Smirnov · Answer 4 · Tue Jan 23 2024 21:54:14 GMT+0800 (China Standard Time)

Hi! Just curious if there's any update on the issue?

Jatin Chaudhary · Answer 5 · Wed Jan 24 2024 20:53:36 GMT+0800 (China Standard Time)

Nothing as of now. I will update here once we have a solution.

Anton Smirnov · Answer 6 · Wed Feb 21 2024 01:43:36 GMT+0800 (China Standard Time)

This issue seems to be fixed with ROCm 6.0.2 & Linux 6.5.0-18.
Not sure from where the fix came though.