[Issue]: hipMalloc fails even though there is enough memory (I just freed it)

Question

[Issue]: hipMalloc fails even though there is enough memory (I just freed it)

jakub-homola opened this issue 7 months ago · comments

Problem Description

In the program written later, the hipMalloc function fails to allocate memory, even though there should be enough memory on the GPU.

What the program does - in a loop, these things happen. I allocate 40 GiB of memory (the GPU has ~64 GiB available) using hipMalloc, use hipMemset to initialize the memory to all 0, then I add a callback using hipStreamAddCallback which just prints a message, then hipDeviceSynchronize, hipFree the 40 GiB of memory I allocated at the beginning of the iteration, and hipDeviceSynchronize again to be sure. After that, the next iteration of the loop starts, again allocating the memory etc.

The problem is, that in the second iteration, the hipMalloc fails with error hipErrorOutOfMemory (full output below).

This is unexpected, since I freed the memory from the previous iteration at the end of that iteration. I tried to lower the amount of memory allocated in the second iteration, and it is able to allocate 23 GiB, but not 24. This suggests that the memory from the previous iteration somehow was not really freed, but is still there allocated, blocking me from doing more of these large allocations.

What is going on here?

If I use a default null stream instead of the one created with hipStreamCreate, everything works fine. If I comment out the hipMemsetAsync, everything works fine. If I comment out the hipStreamAddCallback, it usually works, but I have seen it fail too.

I am on the compute node of the LUMI supercomputer (MI250X GPU). I am using rocm-5.2.3, as this is the only version officially supported there. I also tested with some installation of rocm-5.4.3, but the same issue persisted. I am not able to test with newer rocm version (could you try it, isn't is already fixed?). (In the rocm version dropdown in this issue I selected 5.5.0, because older ones are not in the list.)

Also, when the program works (e.g. when using the null stream), the hipMalloc takes a long time to execute (more than a second) starting with the second iteration. The first hipMalloc is ok and takes just about a millisecond. This is also unexpected for me, I would expect some initialization overhead in the first call, and all other calls to be quick, not the other way around.

Furthermore (assuming no crashes), lets label N to be the allocation size and M to be the gpu memory capacity. The first approximately M/N iterations of the loop allocate the memory quickly, under a millisecond. After this (after the memory of the gpu has all been allocated once), the allocations take approximately 35 milliseconds per gigabyte. There is a strange pattern here. Is this expected?

The full output of the program:

Iteration 1
  Allocating...
  Allocated in 1.701832 ms
  Hello from callback
Iteration 2
  Allocating...
HIP Error 2 hipErrorOutOfMemory: hipErrorOutOfMemory. In file 'source.hip.cpp' on line 31

Operating System

SUSE Linux Enterprise Server 15 SP4

CPU

AMD EPYC 7A53 64-Core Processor

GPU

AMD Instinct MI250X

ROCm Version

ROCm 5.5.0

ROCm Component

No response

Steps to Reproduce

On LUMI compute node (MI250X GPU), rocm-5.2.3 (also tested with 5.4.3 and the same issue occurs) (newer versions not supported on LUMI).

Load modules: module load partition/G LUMI/23.09 rocm/5.2.3
Compile: hipcc -g -O2 -fopenmp --offload-arch=gfx90a:sramecc+:xnack- source.hip.cpp -o program.x
Run: ./program.x
where source.hip.cpp is the following:

#include <cstdio>
#include <hip/hip_runtime.h>
#include <omp.h>

#ifndef CHECK
#define CHECK(status) do { check((status), __FILE__, __LINE__); } while(false)
inline static void check(hipError_t error_code, const char *file, int line)
{
    if (error_code != hipSuccess)
    {
        fprintf(stderr, "HIP Error %d %s: %s. In file '%s' on line %d\n", error_code, hipGetErrorName(error_code), hipGetErrorString(error_code), file, line);
        fflush(stderr);
        exit(error_code);
    }
}
#endif

int main()
{
    const size_t forty_gigabytes = (size_t)40 << 30;

    hipStream_t stream;
    CHECK(hipStreamCreate(&stream));
    
    for(int i = 1; i <= 3; i++)
    {
        printf("Iteration %d\n", i);
        void * mem;
        printf("  Allocating...\n");
        double start = omp_get_wtime();
        CHECK(hipMalloc(&mem, forty_gigabytes));
        double stop = omp_get_wtime();
        printf("  Allocated in %f ms\n", (stop - start) * 1000.0);
        CHECK(hipMemsetAsync(mem, 0, forty_gigabytes, stream));
        CHECK(hipStreamAddCallback(stream, [](hipStream_t /*s*/, hipError_t /*e*/, void * /*arg*/){
            printf("  Hello from callback\n");
        }, nullptr, 0));
        CHECK(hipDeviceSynchronize());
        CHECK(hipFree(mem));
        CHECK(hipDeviceSynchronize());
    }

    CHECK(hipStreamDestroy(stream));

    return 0;
}

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD EPYC 7A53 64-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD EPYC 7A53 64-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2000
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            32
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    131320956(0x7d3cc7c) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    131320956(0x7d3cc7c) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    131320956(0x7d3cc7c) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    AMD EPYC 7A53 64-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD EPYC 7A53 64-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2000
  BDFID:                   0
  Internal Node ID:        1
  Compute Unit:            32
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    132112468(0x7dfe054) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    132112468(0x7dfe054) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    132112468(0x7dfe054) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 3
*******
  Name:                    AMD EPYC 7A53 64-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD EPYC 7A53 64-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    2
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2000
  BDFID:                   0
  Internal Node ID:        2
  Compute Unit:            32
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    132112468(0x7dfe054) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    132112468(0x7dfe054) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    132112468(0x7dfe054) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 4
*******
  Name:                    AMD EPYC 7A53 64-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD EPYC 7A53 64-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    3
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2000
  BDFID:                   0
  Internal Node ID:        3
  Compute Unit:            32
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    132090576(0x7df8ad0) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    132090576(0x7df8ad0) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    132090576(0x7df8ad0) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 5
*******
  Name:                    gfx90a
  Uuid:                    GPU-ab9aee17a8d0754d
  Marketing Name:
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    4
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      8192(0x2000) KB
  Chip ID:                 29704(0x7408)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   1700
  BDFID:                   50688
  Internal Node ID:        4
  Compute Unit:            110
  SIMDs per CU:            4
  Shader Engines:          8
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

Additional Information

$ rocm-smi --showdriverversion


======================= ROCm System Management Interface =======================
========================= Version of System Component ==========================
Driver version: 5.16.9.22.20
================================================================================
============================= End of ROCm SMI Log ==============================

Karthik J · Answer 1 · Thu Jan 11 2024 00:52:28 GMT+0800 (China Standard Time)

I do not see hipErrorOutOfMemory issue with the sample provided, but the second and third hipMalloc does take considerably higher time compared to first hipMalloc.

The reason is hipFree has delayed worker thread mechanism, where the clearing the bits happen after hipFree() control has returned.

Jakub Homola · Answer 2 · Thu Jan 11 2024 02:01:04 GMT+0800 (China Standard Time)

Thanks @kjayapra-amd for the response.

What rocm version did you try? Was it the 5.2.3? Or a newer one? (which would mean it is already fixed) And did you test on LUMI or some other machine?

... hipFree has delayed worker thread mechanism ...

Well, I expected something like that, good to know.

Karthik J · Answer 3 · Thu Jan 11 2024 02:11:12 GMT+0800 (China Standard Time)

Tested it on MI250X, but not LUMI (I don't think that would make a big difference).
ROCm version: 6.x

Jakub Homola · Answer 4 · Thu Jan 11 2024 02:25:07 GMT+0800 (China Standard Time)

Alright, it might be already fixed in newer versions. Will close this when I will have access to a newer version and test it.

Anyway, if you have access to any MI250X machine with older rocm, could you try it there? If you can even replicate it, and possibly to pinpoint the version in which it was fixed.

Karthik J · Answer 5 · Wed Jan 24 2024 05:49:20 GMT+0800 (China Standard Time)

The recommendation would be to move to a newer release. Thanks!