[BUG]: Performance regression with shared memory query operations

Question

[BUG]: Performance regression with shared memory query operations

PointKernel opened this issue 3 months ago · comments

Yunsong Wang commented 3 months ago

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this bug (https://github.com/NVIDIA/cuCollections/issues)

Type of Bug

Performance

Describe the bug

When the hash table is small enough, one can manually load it into shared memory and then query. The performance in this case should be comparable to having a global memory hash table implicitly cached in L1. However, shared memory contains delivers much worse performance compared to the global memory one in this case:

shared_memory (block_size = 1024):

Key	Distribution	NumOutputs	Samples	CPU Time	Noise	GPU Time	Noise	Elem/s
I32	UNIQUE	800	39696x	15.710 us	28.65%	12.599 us	13.00%	63.495M
I32	UNIQUE	2000	32816x	18.259 us	22.74%	15.241 us	10.69%	131.221M
I32	UNIQUE	8000	31920x	18.759 us	21.58%	15.670 us	8.26%	510.539M
I32	UNIQUE	80000	11728x	45.662 us	7.73%	42.656 us	2.94%	1.875G
I32	UNIQUE	800000	1604x	314.924 us	1.05%	311.898 us	0.38%	2.565G
I32	UNIQUE	8000000	167x	3.006 ms	0.19%	3.003 ms	0.15%	2.664G
I32	UNIQUE	80000000	17x	29.929 ms	0.01%	29.927 ms	0.00%	2.673G
I64	UNIQUE	800	35024x	17.337 us	23.02%	14.278 us	7.82%	56.031M
I64	UNIQUE	2000	30192x	19.622 us	19.87%	16.563 us	6.59%	120.751M
I64	UNIQUE	8000	29520x	20.023 us	20.06%	16.942 us	7.51%	472.210M
I64	UNIQUE	80000	10080x	52.711 us	6.82%	49.652 us	2.65%	1.611G
I64	UNIQUE	800000	1403x	380.149 us	0.96%	377.119 us	0.50%	2.121G
I64	UNIQUE	8000000	138x	3.632 ms	0.11%	3.629 ms	0.03%	2.205G
I64	UNIQUE	80000000	14x	35.944 ms	0.01%	35.941 ms	0.00%	2.226G

global memory:

Key	Distribution	NumOutputs	Samples	CPU Time	Noise	GPU Time	Noise	Elem/s
I32	UNIQUE	800	66784x	10.678 us	47.51%	7.487 us	18.63%	106.854M
I32	UNIQUE	2000	64320x	10.932 us	44.70%	7.774 us	16.48%	257.269M
I32	UNIQUE	8000	62640x	11.161 us	44.79%	7.982 us	18.55%	1.002G
I32	UNIQUE	80000	35520x	17.181 us	26.87%	14.080 us	14.23%	5.682G
I32	UNIQUE	800000	7552x	69.285 us	5.07%	66.239 us	2.00%	12.077G
I32	UNIQUE	8000000	2016x	622.167 us	1.21%	619.066 us	1.09%	12.923G
I32	UNIQUE	80000000	848x	6.216 ms	0.82%	6.213 ms	0.82%	12.876G
I64	UNIQUE	800	64640x	10.884 us	46.03%	7.736 us	19.71%	103.415M
I64	UNIQUE	2000	63936x	11.054 us	45.48%	7.822 us	16.70%	255.690M
I64	UNIQUE	8000	59792x	11.565 us	42.66%	8.363 us	16.62%	956.566M
I64	UNIQUE	80000	33040x	18.327 us	26.25%	15.140 us	14.08%	5.284G
I64	UNIQUE	800000	7152x	74.976 us	5.02%	71.897 us	2.42%	11.127G
I64	UNIQUE	8000000	2016x	671.884 us	1.34%	668.698 us	1.24%	11.964G
I64	UNIQUE	80000000	1104x	6.662 ms	0.86%	6.659 ms	0.86%	12.014G

How to Reproduce

#458

Additional context

Note: shared memory find has the same performance issue it should be a common problem for query operations.

The quick takeaway of this experiment is that we should not use shared memory hash table for query operations. Further investigations are needed to understand where the performance regression came from.

Yunsong Wang · Answer 1 · Sat Apr 20 2024 05:24:55 GMT+0800 (China Standard Time)

contains_profiling.zip

NCU profiling results