m4rs-mt / ILGPU

ILGPU JIT Compiler for high-performance .Net GPU programs

Home Page:http://www.ilgpu.net

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about hardware memory alignments

TriceHelix opened this issue · comments

Dear ILGPU Devs & Community,

From my understanding, when allocating memory the hardware will return a pointer aligned to a specific number of bytes for technical and performance reasons. According to this, that number is 256 for CUDA GPUs. I imagine that other types of hardware will adhere to different alignments.

Does ILGPU provide a hardware independent way to query this information?

The reason I am wondering about this is because I'm trying to reduce the number of memory allocations I make (in order to avoid accumulating API overhead). Instead of multiple buffers, I allocate a single one and split it into multiple sections for arrays and variables. According to the guide I referenced, these sections should be aligned the same way the original allocation was, otherwise things may break (at least for CUDA devices).
Quote:

Any address of a variable residing in global memory or returned by one of the memory allocation routines from the driver or runtime API is always aligned to at least 256 bytes.

Reading non-naturally aligned 8-byte or 16-byte words produces incorrect results (off by a few words), so special care must be taken to maintain alignment of the starting address of any value or array of values of these types. A typical case where this might be easily overlooked is when using some custom global memory allocation scheme, whereby the allocations of multiple arrays (with multiple calls to cudaMalloc() or cuMemAlloc()) is replaced by the allocation of a single large block of memory partitioned into multiple arrays, in which case the starting address of each array is offset from the block’s starting address.

Hi @TriceHelix, thanks for reaching out to us regarding this topic. Currently, ILGPU is internally aware of the global pointer alignment (e.g., here) and uses this information to automatically vectorize IO ops, for instance. Based on the backend and the accelerator (... and your kernel of course) ILGPU will do "everything for you" when using the O2 optimization pipeline. However, you can also control the alignment of views inside kernels via view.AlignToXYZ() and view.AlignTo(xyz) (where xyz evaluates to a constant). Moreover, you can "tell" ILGPU that a particular view as a certain alignment via view.AsAlignedXYZ() and view.AsAligned(xyz) (where xyz evaluates to a constant at kernel compile time).

To answer your actual question, there is no such API to query the automatic alignment rules at the moment. However, I think this makes absolutely sense to add. @MoFtZ what do you think?

Hi @m4rs-mt , thanks for your fast response. I was aware of the alignment methods, however I believe I am still running into issues with it because of my allocation & dissection strategy.
The main buffer I allocate is of type byte, since the data I want to actually store consists of multiple types. I originally thought that I could simply compute the byte offsets of all sections, create Array-Sub-Views, then cast the individual views to the corresponding data types (all done on the CPU, prior to launching any kernels). Much like the CUDA guide says, this was a mistake without doing any alignments and lead to many errors and inconsistent data. The solution was to allocate some extra bytes for alignment padding and aligning each sub-section to the hardware preferred amount of bytes (for me that's 256), then casting the ArrayView to a different type. Hence the need for an API to query this alignment information in order to easily target different hardware.

EDIT: I have performed more tests on my code and can confirm that aligning subsections of the main buffer to 256 bytes (or a multiple of that) before using them independently is, in fact, the only way to avoid errors or out of bounds memory accesses. I'm looking forward to the new API!

Turns out most of the issues I was facing were due to the Cast method messing with aligned memory, not the actual alignments themselves. I also realized ArrayViews are not eqivalent to pointers, as they all refer to a buffer at a certain offset under the hood. As you've said, all of that alignment for the buffer is done automatically. However, I'd still say that an API to make the alignment amounts transparent would be neat. For now, consider the other issue the actual problem I've encountered.