NVIDIA / gdrcopy

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Increasing utilization - gdrcopy_copybw

GuyZilberman opened this issue · comments

Hi,

I am running the gdrcopy_copybw benchmark on NVIDIA A100 80GB PCIe with Gen 4 PCIe.
It appears that the utilization doesn't reach its maximal possible value, getting about 20 GB/s out of the possible 32 GB/s, for buffers of sizes 32kB-8MB.
Upon looking into the code, it appears that in your implementation of memcpy_uncached_store_avx you are using 256 bit functions _mm256_load_pd and _mm256_stream_pd.
What could be the reason for that? Is there a reason for not using 512 bit functions _mm512_load_pd and _mm512_stream_pd instead? Could using the 512 bit functions increase the utilization?

Thanks!

@pakmarkthub @drossetti can one of you please take a look at this question?
We are thinking of implementing 512b copy ourselves (and potentially contribute to the repository)
but we want to make sure this is not a waste of time
have you considered supporting _mm512_load_pd in the past and rejected it for some reason?
appreciate your help
Eshcar

It appears that the utilization doesn't reach its maximal possible value, getting about 20 GB/s out of the possible 32 GB/s, for buffers of sizes 32kB-8MB.

This question has been asked multiple times over the years, see for example Pak's comments in #286.
The short story is that CPU cores are are not designed as perfect DMA controllers, and can only generate small PCIe write packets, typically up to 64B instead of 256B which is the maximum supported by the GPU.

Regarding mm512, in the past I tried to use AVX2 without any improvement. That is architecturally understandable as the cache line bandwidth between a CPU core and its L1 is already saturated using finer grain loads.

That being said, there might be architectural innovations in the new CPUs, so a quick check would not hurt.

Thank you for your response!

We will try to do it and will let you know if we get any improvement in our attempt.

Are any of the previously attempted implementations accessible anywhere for reference?