[Issue]: `__syncthreads` not syncing global memory as per its definition.
JackAKirk opened this issue · comments
Problem Description
Hi I've been investigating the amd memory model to understand how well defined it is with respect to the latest C++ memory model, to judge how appropriate it is for safety critical applications. The memory model documentation is extensive, however there is the below issue that I think is either a documentation error or an (potentially very serious) compiler bug (see right at the bottom for the concise question!). Here is the background:
From available documentation, e.g. https://rocm.docs.amd.com/projects/HIP/en/docs-5.7.1/reference/kernel_language.html#synchronization-functions
__syncthreads
in hip maps exactly to the definition of __syncthreads
in cuda. An important part of this function definition is:
- Def 1: "and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block."
Now, looking at e.g. the mi200 (gfx90a) isa section "4.4. Data Dependency Resolution", it is quite clear:
- "Shader hardware resolves most data dependencies, but a few cases must be explicitly handled
by the shader program. In these cases, the program must insert S_WAITCNT instructions to
ensure that previous operations have completed before continuing.
"
Then in "Table 17. IB_STS"
-
"VM_CNT ...
Number of VMEM instructions issued but not yet returned" -
"LGKM_CNT 11:8 LDS, GDS, Constant-memory and Message instructions issued-but-not-completed count."
, then under "10.4 Global"
- "Since these instructions do not access LDS, only VM_CNT is used, not LGKM_CNT. If a global
instruction does attempt to access LDS, the instruction returns MEM_VIOL."
and:
- "S_WAITCNT Wait for the counts of outstanding lds, vector-memory and
export/vmem-write-data to be at or below the specified levels."
OK so we can clearly summarize the above information as:
- For
__syncthreads()
to synchronize global and shared (LDS) memory accesses (on gfx90a), it needs to call
s_waitcnt lgkmcnt(0)
ands_waitcnt vmcnt(0)
.
The above is consistent with the definition of fences in the gfx90a memory model documentation: https://llvm.org/docs/AMDGPUUsage.html#memory-model-gfx90a : note s_waitcnt vmcnt(0)
is required in the fences, and note that the definition of __syncthreads()
(see "def 1" above) does not allow any omission of any of the optional instructions (see gfx90a memory model) in these fence operations.
However if I compile the following on gfx90a with -save-temps
#include <hip/hip_runtime.h>
__global__ void test_kernel() {
__syncthreads();
}
int main() {
test_kernel<<<128, 64>>>();
}
the corresponding asm code generated by __syncthreads();
is (note this doesn't change depending on opt level or whether I add loads/stores etc)
s_waitcnt lgkmcnt(0)
s_barrier
s_waitcnt lgkmcnt(0)
So then my question is very straightforward:
- Is the compiler implementation of
__syncthreads()
in hip wrong because it isn't synchronizing global VMEM operations.
or - Is the documentation wrong and e.g. VMEM instructions are always synchronous for gfx90a
?
Thanks
Note that this seems to be happening for all rocm versions up to at least 5.7.1.
Operating System
Crusher/Frontier
CPU
Crusher/Frontier cpu
GPU
AMD Instinct MI250X
ROCm Version
ROCm 5.7.1
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
@JackAKirk your test kernel isn't actually doing any memory operations. The compiler could optimize away the entire body of the kernel with no observable effect (except possibly in the runtime). You can assume that __syncthreads() operates as expected in all cases. If you find any case where it does not, please open a new issue with a reproducer.
In fact, with optimization enabled and a later compiler, the ISA for your kernel is reduced to:
s_barrier
s_endpgm
@JackAKirk your test kernel isn't actually doing any memory operations. The compiler could optimize away the entire body of the kernel with no observable effect (except possibly in the runtime). You can assume that __syncthreads() operates as expected in all cases. If you find any case where it does not, please open a new issue with a reproducer.
Thanks for the reply.
This is one example I tried that is both writing to global memory prior to a __syncthreads
, and read from global post a __syncthreads
with the same behavior that I described in my message. https://github.com/ROCm/HIP-Examples/blob/ff8123937c8851d86b1edfbad9f032462c48aa05/HIP-Examples-Applications/RecursiveGaussian/RecursiveGaussian.cpp
@JackAKirk please clarify. Are you reporting that one or more of the __syncthreads() calls in the example is not working as expected? Which ones? Which variables written before the call do not have the correct contents after the call? Can you further reduce the code to a single call to __syncthreads()?
@JackAKirk please clarify. Are you reporting that one or more of the __syncthreads() calls in the example is not working as expected? Which ones? Which variables written before the call do not have the correct contents after the call? Can you further reduce the code to a single call to __syncthreads()?
Sure, to clarify I don't have an example code that I know is failing due to incorrect __syncthreads() behavior leading to invalid contents. As described in my first message I have just observed that __syncthreads() is not leading to any call of s_waitcnt vmcnt(0)
, which from what I described of the documentation is expected and required in order to sync global memory.
I just want to know if this is due to either me misunderstanding the documentation or it being wrong, or whether this is a bug and there is a missing s_waitcnt vmcnt(0)
. If this is expected behavior then perhaps you could just update the documentation to make this clearer.
Thanks
@JackAKirk thanks for clarifying. I'm not clear on what kind of documentation update you have in mind. The example you provided has no memory operations and it is common for compilers to suppress unnecessary operations. If you chose an example that includes memory operations, I expect you will see instructions that ensure proper behavior.
@JackAKirk thanks for clarifying. I'm not clear on what kind of documentation update you have in mind. The example you provided has no memory operations and it is common for compilers to suppress unnecessary operations. If you chose an example that includes memory operations, I expect you will see instructions that ensure proper behavior.
I described the relevant memory operations the sample has here (read/writes to global): #3413 (comment)
I guess I've just misunderstand something, but the documentation I've read implies (to me) these are the relevant memory operations.