ROCm / HIP

HIP: C++ Heterogeneous-Compute Interface for Portability

Home Page:https://rocmdocs.amd.com/projects/HIP/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Issue]: `__syncthreads` not syncing global memory as per its definition.

JackAKirk opened this issue · comments

Problem Description

Hi I've been investigating the amd memory model to understand how well defined it is with respect to the latest C++ memory model, to judge how appropriate it is for safety critical applications. The memory model documentation is extensive, however there is the below issue that I think is either a documentation error or an (potentially very serious) compiler bug (see right at the bottom for the concise question!). Here is the background:

From available documentation, e.g. https://rocm.docs.amd.com/projects/HIP/en/docs-5.7.1/reference/kernel_language.html#synchronization-functions
__syncthreads in hip maps exactly to the definition of __syncthreads in cuda. An important part of this function definition is:

  • Def 1: "and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block."

Now, looking at e.g. the mi200 (gfx90a) isa section "4.4. Data Dependency Resolution", it is quite clear:

  • "Shader hardware resolves most data dependencies, but a few cases must be explicitly handled
    by the shader program. In these cases, the program must insert S_WAITCNT instructions to
    ensure that previous operations have completed before continuing.
    "

Then in "Table 17. IB_STS"

  • "VM_CNT ...
    Number of VMEM instructions issued but not yet returned"

  • "LGKM_CNT 11:8 LDS, GDS, Constant-memory and Message instructions issued-but-not-completed count."

, then under "10.4 Global"

  • "Since these instructions do not access LDS, only VM_CNT is used, not LGKM_CNT. If a global
    instruction does attempt to access LDS, the instruction returns MEM_VIOL."

and:

  • "S_WAITCNT Wait for the counts of outstanding lds, vector-memory and
    export/vmem-write-data to be at or below the specified levels."

OK so we can clearly summarize the above information as:

  • For __syncthreads() to synchronize global and shared (LDS) memory accesses (on gfx90a), it needs to call
    s_waitcnt lgkmcnt(0) and s_waitcnt vmcnt(0).

The above is consistent with the definition of fences in the gfx90a memory model documentation: https://llvm.org/docs/AMDGPUUsage.html#memory-model-gfx90a : note s_waitcnt vmcnt(0) is required in the fences, and note that the definition of __syncthreads() (see "def 1" above) does not allow any omission of any of the optional instructions (see gfx90a memory model) in these fence operations.

However if I compile the following on gfx90a with -save-temps

#include <hip/hip_runtime.h>

__global__ void test_kernel() {
__syncthreads();
}

int main() {
  test_kernel<<<128, 64>>>();
}

the corresponding asm code generated by __syncthreads(); is (note this doesn't change depending on opt level or whether I add loads/stores etc)

        s_waitcnt lgkmcnt(0)
        s_barrier
        s_waitcnt lgkmcnt(0)

So then my question is very straightforward:

  • Is the compiler implementation of __syncthreads() in hip wrong because it isn't synchronizing global VMEM operations.
    or
  • Is the documentation wrong and e.g. VMEM instructions are always synchronous for gfx90a
    ?

Thanks

Note that this seems to be happening for all rocm versions up to at least 5.7.1.

Operating System

Crusher/Frontier

CPU

Crusher/Frontier cpu

GPU

AMD Instinct MI250X

ROCm Version

ROCm 5.7.1

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@JackAKirk your test kernel isn't actually doing any memory operations. The compiler could optimize away the entire body of the kernel with no observable effect (except possibly in the runtime). You can assume that __syncthreads() operates as expected in all cases. If you find any case where it does not, please open a new issue with a reproducer.

In fact, with optimization enabled and a later compiler, the ISA for your kernel is reduced to:
s_barrier
s_endpgm

@JackAKirk your test kernel isn't actually doing any memory operations. The compiler could optimize away the entire body of the kernel with no observable effect (except possibly in the runtime). You can assume that __syncthreads() operates as expected in all cases. If you find any case where it does not, please open a new issue with a reproducer.

Thanks for the reply.

This is one example I tried that is both writing to global memory prior to a __syncthreads, and read from global post a __syncthreads with the same behavior that I described in my message. https://github.com/ROCm/HIP-Examples/blob/ff8123937c8851d86b1edfbad9f032462c48aa05/HIP-Examples-Applications/RecursiveGaussian/RecursiveGaussian.cpp

@JackAKirk please clarify. Are you reporting that one or more of the __syncthreads() calls in the example is not working as expected? Which ones? Which variables written before the call do not have the correct contents after the call? Can you further reduce the code to a single call to __syncthreads()?

@JackAKirk please clarify. Are you reporting that one or more of the __syncthreads() calls in the example is not working as expected? Which ones? Which variables written before the call do not have the correct contents after the call? Can you further reduce the code to a single call to __syncthreads()?

Sure, to clarify I don't have an example code that I know is failing due to incorrect __syncthreads() behavior leading to invalid contents. As described in my first message I have just observed that __syncthreads() is not leading to any call of s_waitcnt vmcnt(0), which from what I described of the documentation is expected and required in order to sync global memory.

I just want to know if this is due to either me misunderstanding the documentation or it being wrong, or whether this is a bug and there is a missing s_waitcnt vmcnt(0). If this is expected behavior then perhaps you could just update the documentation to make this clearer.

Thanks

@JackAKirk thanks for clarifying. I'm not clear on what kind of documentation update you have in mind. The example you provided has no memory operations and it is common for compilers to suppress unnecessary operations. If you chose an example that includes memory operations, I expect you will see instructions that ensure proper behavior.

@JackAKirk thanks for clarifying. I'm not clear on what kind of documentation update you have in mind. The example you provided has no memory operations and it is common for compilers to suppress unnecessary operations. If you chose an example that includes memory operations, I expect you will see instructions that ensure proper behavior.

I described the relevant memory operations the sample has here (read/writes to global): #3413 (comment)

I guess I've just misunderstand something, but the documentation I've read implies (to me) these are the relevant memory operations.