[BUG]: Performance regression with shared memory query operations
PointKernel opened this issue · comments
Is this a duplicate?
- I confirmed there appear to be no duplicate issues for this bug (https://github.com/NVIDIA/cuCollections/issues)
Type of Bug
Performance
Describe the bug
When the hash table is small enough, one can manually load it into shared memory and then query. The performance in this case should be comparable to having a global memory hash table implicitly cached in L1. However, shared memory contains
delivers much worse performance compared to the global memory one in this case:
shared_memory (block_size = 1024):
Key | Distribution | NumOutputs | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s |
---|---|---|---|---|---|---|---|---|
I32 | UNIQUE | 800 | 39696x | 15.710 us | 28.65% | 12.599 us | 13.00% | 63.495M |
I32 | UNIQUE | 2000 | 32816x | 18.259 us | 22.74% | 15.241 us | 10.69% | 131.221M |
I32 | UNIQUE | 8000 | 31920x | 18.759 us | 21.58% | 15.670 us | 8.26% | 510.539M |
I32 | UNIQUE | 80000 | 11728x | 45.662 us | 7.73% | 42.656 us | 2.94% | 1.875G |
I32 | UNIQUE | 800000 | 1604x | 314.924 us | 1.05% | 311.898 us | 0.38% | 2.565G |
I32 | UNIQUE | 8000000 | 167x | 3.006 ms | 0.19% | 3.003 ms | 0.15% | 2.664G |
I32 | UNIQUE | 80000000 | 17x | 29.929 ms | 0.01% | 29.927 ms | 0.00% | 2.673G |
I64 | UNIQUE | 800 | 35024x | 17.337 us | 23.02% | 14.278 us | 7.82% | 56.031M |
I64 | UNIQUE | 2000 | 30192x | 19.622 us | 19.87% | 16.563 us | 6.59% | 120.751M |
I64 | UNIQUE | 8000 | 29520x | 20.023 us | 20.06% | 16.942 us | 7.51% | 472.210M |
I64 | UNIQUE | 80000 | 10080x | 52.711 us | 6.82% | 49.652 us | 2.65% | 1.611G |
I64 | UNIQUE | 800000 | 1403x | 380.149 us | 0.96% | 377.119 us | 0.50% | 2.121G |
I64 | UNIQUE | 8000000 | 138x | 3.632 ms | 0.11% | 3.629 ms | 0.03% | 2.205G |
I64 | UNIQUE | 80000000 | 14x | 35.944 ms | 0.01% | 35.941 ms | 0.00% | 2.226G |
global memory:
Key | Distribution | NumOutputs | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s |
---|---|---|---|---|---|---|---|---|
I32 | UNIQUE | 800 | 66784x | 10.678 us | 47.51% | 7.487 us | 18.63% | 106.854M |
I32 | UNIQUE | 2000 | 64320x | 10.932 us | 44.70% | 7.774 us | 16.48% | 257.269M |
I32 | UNIQUE | 8000 | 62640x | 11.161 us | 44.79% | 7.982 us | 18.55% | 1.002G |
I32 | UNIQUE | 80000 | 35520x | 17.181 us | 26.87% | 14.080 us | 14.23% | 5.682G |
I32 | UNIQUE | 800000 | 7552x | 69.285 us | 5.07% | 66.239 us | 2.00% | 12.077G |
I32 | UNIQUE | 8000000 | 2016x | 622.167 us | 1.21% | 619.066 us | 1.09% | 12.923G |
I32 | UNIQUE | 80000000 | 848x | 6.216 ms | 0.82% | 6.213 ms | 0.82% | 12.876G |
I64 | UNIQUE | 800 | 64640x | 10.884 us | 46.03% | 7.736 us | 19.71% | 103.415M |
I64 | UNIQUE | 2000 | 63936x | 11.054 us | 45.48% | 7.822 us | 16.70% | 255.690M |
I64 | UNIQUE | 8000 | 59792x | 11.565 us | 42.66% | 8.363 us | 16.62% | 956.566M |
I64 | UNIQUE | 80000 | 33040x | 18.327 us | 26.25% | 15.140 us | 14.08% | 5.284G |
I64 | UNIQUE | 800000 | 7152x | 74.976 us | 5.02% | 71.897 us | 2.42% | 11.127G |
I64 | UNIQUE | 8000000 | 2016x | 671.884 us | 1.34% | 668.698 us | 1.24% | 11.964G |
I64 | UNIQUE | 80000000 | 1104x | 6.662 ms | 0.86% | 6.659 ms | 0.86% | 12.014G |
How to Reproduce
Additional context
Note: shared memory find
has the same performance issue it should be a common problem for query operations.
The quick takeaway of this experiment is that we should not use shared memory hash table for query operations. Further investigations are needed to understand where the performance regression came from.
NCU profiling results