NVIDIA / cuCollections

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG]: Performance regression with shared memory query operations

PointKernel opened this issue · comments

Is this a duplicate?

Type of Bug

Performance

Describe the bug

When the hash table is small enough, one can manually load it into shared memory and then query. The performance in this case should be comparable to having a global memory hash table implicitly cached in L1. However, shared memory contains delivers much worse performance compared to the global memory one in this case:

shared_memory (block_size = 1024):

Key Distribution NumOutputs Samples CPU Time Noise GPU Time Noise Elem/s
I32 UNIQUE 800 39696x 15.710 us 28.65% 12.599 us 13.00% 63.495M
I32 UNIQUE 2000 32816x 18.259 us 22.74% 15.241 us 10.69% 131.221M
I32 UNIQUE 8000 31920x 18.759 us 21.58% 15.670 us 8.26% 510.539M
I32 UNIQUE 80000 11728x 45.662 us 7.73% 42.656 us 2.94% 1.875G
I32 UNIQUE 800000 1604x 314.924 us 1.05% 311.898 us 0.38% 2.565G
I32 UNIQUE 8000000 167x 3.006 ms 0.19% 3.003 ms 0.15% 2.664G
I32 UNIQUE 80000000 17x 29.929 ms 0.01% 29.927 ms 0.00% 2.673G
I64 UNIQUE 800 35024x 17.337 us 23.02% 14.278 us 7.82% 56.031M
I64 UNIQUE 2000 30192x 19.622 us 19.87% 16.563 us 6.59% 120.751M
I64 UNIQUE 8000 29520x 20.023 us 20.06% 16.942 us 7.51% 472.210M
I64 UNIQUE 80000 10080x 52.711 us 6.82% 49.652 us 2.65% 1.611G
I64 UNIQUE 800000 1403x 380.149 us 0.96% 377.119 us 0.50% 2.121G
I64 UNIQUE 8000000 138x 3.632 ms 0.11% 3.629 ms 0.03% 2.205G
I64 UNIQUE 80000000 14x 35.944 ms 0.01% 35.941 ms 0.00% 2.226G

global memory:

Key Distribution NumOutputs Samples CPU Time Noise GPU Time Noise Elem/s
I32 UNIQUE 800 66784x 10.678 us 47.51% 7.487 us 18.63% 106.854M
I32 UNIQUE 2000 64320x 10.932 us 44.70% 7.774 us 16.48% 257.269M
I32 UNIQUE 8000 62640x 11.161 us 44.79% 7.982 us 18.55% 1.002G
I32 UNIQUE 80000 35520x 17.181 us 26.87% 14.080 us 14.23% 5.682G
I32 UNIQUE 800000 7552x 69.285 us 5.07% 66.239 us 2.00% 12.077G
I32 UNIQUE 8000000 2016x 622.167 us 1.21% 619.066 us 1.09% 12.923G
I32 UNIQUE 80000000 848x 6.216 ms 0.82% 6.213 ms 0.82% 12.876G
I64 UNIQUE 800 64640x 10.884 us 46.03% 7.736 us 19.71% 103.415M
I64 UNIQUE 2000 63936x 11.054 us 45.48% 7.822 us 16.70% 255.690M
I64 UNIQUE 8000 59792x 11.565 us 42.66% 8.363 us 16.62% 956.566M
I64 UNIQUE 80000 33040x 18.327 us 26.25% 15.140 us 14.08% 5.284G
I64 UNIQUE 800000 7152x 74.976 us 5.02% 71.897 us 2.42% 11.127G
I64 UNIQUE 8000000 2016x 671.884 us 1.34% 668.698 us 1.24% 11.964G
I64 UNIQUE 80000000 1104x 6.662 ms 0.86% 6.659 ms 0.86% 12.014G

How to Reproduce

#458

Additional context

Note: shared memory find has the same performance issue it should be a common problem for query operations.

The quick takeaway of this experiment is that we should not use shared memory hash table for query operations. Further investigations are needed to understand where the performance regression came from.

contains_profiling.zip

NCU profiling results