ROCm / HIP

HIP: C++ Heterogeneous-Compute Interface for Portability

Home Page:https://rocmdocs.amd.com/projects/HIP/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Issue]: Creating two HIP streams causes 100% GPU utilization

pxl-th opened this issue · comments

Problem Description

Creating two HIP streams causes 100% GPU utilization.
This is observed on ROCm 5.7-6.0 and on RX 7600, RX 7800 XT and RX 7900 XTX (at least).

Here's the utilization graph using resources during the execution of C++ MWE below (this is observed with rocm-smi as well):

Operating System

Ubuntu 22.04.3 LTS (Jammy Jellyfish)

CPU

AMD Ryzen 7 5800X 8-Core Processor

GPU

AMD Radeon RX 7900 XT

ROCm Version

ROCm 6.0.0

Steps to Reproduce

C++ MWE:

#include <hip/hip_runtime.h>
#include <iostream>
#include <chrono>
#include <thread>

using namespace std;

void check(int res) {
    if (res != 0) {
        std::cerr << "Fail" << std::endl;
    }
}

int main(int argc, char* argv[]) {
    hipStream_t s1;
    check(hipStreamCreateWithPriority(&s1, 0, 0));

    hipStream_t s2;
    check(hipStreamCreateWithPriority(&s2, 0, 0));

    std::this_thread::sleep_for(std::chrono::seconds(5));
    return 0;
}

Compile with hipcc main.cpp & run a.out and observe utilization during program execution.

I could not reproduce it on Navi21(6900 XT).

rocm-smi reads the data from the driver to populate percent usage. Will forward this to relevant teams to get more information.

This looks to be a Navi 3 issue. I was also not able to reproduce it on RX6700 XT.

If this is not a monitoring bug, it might partially explain, why we are seeing random hangs in our AMDGPU.jl CI only with Navi 3, because tests run on multiple workers using multiple streams.

Hi! Just curious if there's any update on the issue?

Nothing as of now. I will update here once we have a solution.

This issue seems to be fixed with ROCm 6.0.2 & Linux 6.5.0-18.
Not sure from where the fix came though.