Batched `cupy.sum` on short reduction axes are slow
asi1024 opened this issue · comments
Description
Using cub::SegmentedReduce::Sum
to find batched reduce-sum with short reduction axes seems to degrade performance (cc: @leofang)
To Reproduce
Script:
import cupy
from cupy import testing
from cupyx.profiler import benchmark
from cupy._core import _accelerator
n = 24
for i in range(n + 1):
shape = (2 ** i, 2 ** (n - i))
x = testing.shaped_random(shape, xp=cupy, dtype=cupy.float32)
f = lambda: x.sum(axis=1)
for acc in (["cub"], []):
name=f'{shape=}, {acc=}'.ljust(32)
_accelerator.set_routine_accelerators(acc)
_accelerator.set_reduction_accelerators(acc)
perf = benchmark(f, (), n_warmup=1, n_repeat=20, name=name)
print(perf)
Result:
shape=(1, 16777216), acc=['cub']: CPU: 26.774 us +/- 15.314 (min: 20.986 / max: 92.476) us GPU-0: 2429.798 us +/- 14.812 (min: 2419.712 / max: 2492.416) us
shape=(1, 16777216), acc=[] : CPU: 29.575 us +/- 12.893 (min: 25.245 / max: 85.183) us GPU-0: 15859.610 us +/- 25.041 (min: 15814.656 / max: 15921.152) us
shape=(2, 8388608), acc=['cub'] : CPU: 22.423 us +/- 2.995 (min: 20.244 / max: 31.022) us GPU-0: 1226.445 us +/- 3.578 (min: 1221.632 / max: 1235.968) us
shape=(2, 8388608), acc=[] : CPU: 27.899 us +/- 9.635 (min: 24.382 / max: 69.227) us GPU-0: 8306.074 us +/- 9.421 (min: 8291.328 / max: 8338.432) us
shape=(4, 4194304), acc=['cub'] : CPU: 21.922 us +/- 2.638 (min: 20.004 / max: 29.159) us GPU-0: 626.893 us +/- 2.834 (min: 623.616 / max: 634.880) us
shape=(4, 4194304), acc=[] : CPU: 25.847 us +/- 2.671 (min: 24.650 / max: 36.779) us GPU-0: 4175.360 us +/- 4.562 (min: 4166.656 / max: 4188.160) us
shape=(8, 2097152), acc=['cub'] : CPU: 38.684 us +/- 76.913 (min: 19.650 / max: 373.826) us GPU-0: 348.774 us +/- 77.878 (min: 327.680 / max: 688.128) us
shape=(8, 2097152), acc=[] : CPU: 25.308 us +/- 2.838 (min: 23.652 / max: 36.125) us GPU-0: 2139.699 us +/- 5.365 (min: 2128.896 / max: 2151.424) us
shape=(16, 1048576), acc=['cub']: CPU: 20.951 us +/- 1.941 (min: 19.680 / max: 28.707) us GPU-0: 182.016 us +/- 1.993 (min: 180.224 / max: 189.440) us
shape=(16, 1048576), acc=[] : CPU: 25.276 us +/- 2.068 (min: 23.885 / max: 33.285) us GPU-0: 1096.909 us +/- 4.116 (min: 1090.560 / max: 1108.992) us
shape=(32, 524288), acc=['cub'] : CPU: 20.958 us +/- 1.712 (min: 19.784 / max: 27.497) us GPU-0: 109.619 us +/- 2.476 (min: 106.496 / max: 117.760) us
shape=(32, 524288), acc=[] : CPU: 25.534 us +/- 2.463 (min: 23.556 / max: 34.311) us GPU-0: 563.558 us +/- 3.876 (min: 558.080 / max: 577.536) us
shape=(64, 262144), acc=['cub'] : CPU: 22.116 us +/- 3.914 (min: 19.856 / max: 36.777) us GPU-0: 78.694 us +/- 3.822 (min: 74.752 / max: 92.160) us
shape=(64, 262144), acc=[] : CPU: 25.496 us +/- 2.592 (min: 23.864 / max: 35.152) us GPU-0: 295.373 us +/- 3.097 (min: 291.840 / max: 306.176) us
shape=(128, 131072), acc=['cub']: CPU: 20.917 us +/- 1.745 (min: 19.735 / max: 27.680) us GPU-0: 69.786 us +/- 2.029 (min: 68.608 / max: 77.824) us
shape=(128, 131072), acc=[] : CPU: 25.511 us +/- 2.386 (min: 23.795 / max: 34.476) us GPU-0: 174.541 us +/- 2.518 (min: 172.032 / max: 184.320) us
shape=(256, 65536), acc=['cub'] : CPU: 21.537 us +/- 1.787 (min: 20.550 / max: 28.584) us GPU-0: 68.045 us +/- 1.874 (min: 66.560 / max: 75.776) us
shape=(256, 65536), acc=[] : CPU: 26.720 us +/- 2.939 (min: 24.254 / max: 36.635) us GPU-0: 111.821 us +/- 2.978 (min: 108.544 / max: 121.856) us
shape=(512, 32768), acc=['cub'] : CPU: 21.592 us +/- 1.662 (min: 20.574 / max: 28.206) us GPU-0: 68.250 us +/- 1.840 (min: 66.560 / max: 75.776) us
shape=(512, 32768), acc=[] : CPU: 27.014 us +/- 2.458 (min: 24.568 / max: 35.970) us GPU-0: 111.155 us +/- 1.956 (min: 108.544 / max: 117.760) us
shape=(1024, 16384), acc=['cub']: CPU: 21.647 us +/- 1.748 (min: 20.584 / max: 28.635) us GPU-0: 69.069 us +/- 2.135 (min: 67.584 / max: 77.824) us
shape=(1024, 16384), acc=[] : CPU: 27.259 us +/- 3.538 (min: 24.602 / max: 37.223) us GPU-0: 96.205 us +/- 3.598 (min: 93.184 / max: 106.496) us
shape=(2048, 8192), acc=['cub'] : CPU: 21.751 us +/- 1.749 (min: 20.611 / max: 28.420) us GPU-0: 68.403 us +/- 1.820 (min: 66.560 / max: 75.776) us
shape=(2048, 8192), acc=[] : CPU: 26.338 us +/- 2.323 (min: 24.458 / max: 35.013) us GPU-0: 91.494 us +/- 2.319 (min: 89.088 / max: 100.352) us
shape=(4096, 4096), acc=['cub'] : CPU: 21.637 us +/- 1.823 (min: 20.432 / max: 28.816) us GPU-0: 68.403 us +/- 1.606 (min: 66.560 / max: 74.752) us
shape=(4096, 4096), acc=[] : CPU: 26.932 us +/- 2.779 (min: 24.665 / max: 36.798) us GPU-0: 97.434 us +/- 2.676 (min: 95.232 / max: 107.520) us
shape=(8192, 2048), acc=['cub'] : CPU: 21.714 us +/- 2.099 (min: 20.487 / max: 29.103) us GPU-0: 69.222 us +/- 2.158 (min: 67.584 / max: 77.824) us
shape=(8192, 2048), acc=[] : CPU: 26.533 us +/- 2.480 (min: 24.702 / max: 35.654) us GPU-0: 108.134 us +/- 2.517 (min: 106.496 / max: 117.760) us
shape=(16384, 1024), acc=['cub']: CPU: 21.689 us +/- 1.934 (min: 20.536 / max: 29.293) us GPU-0: 77.158 us +/- 1.950 (min: 75.776 / max: 84.992) us
shape=(16384, 1024), acc=[] : CPU: 27.076 us +/- 3.201 (min: 24.885 / max: 36.887) us GPU-0: 139.622 us +/- 3.177 (min: 137.216 / max: 149.504) us
shape=(32768, 512), acc=['cub'] : CPU: 21.522 us +/- 1.818 (min: 20.387 / max: 28.492) us GPU-0: 76.698 us +/- 2.170 (min: 74.752 / max: 84.992) us
shape=(32768, 512), acc=[] : CPU: 26.370 us +/- 2.306 (min: 24.647 / max: 34.772) us GPU-0: 213.043 us +/- 2.368 (min: 210.944 / max: 222.208) us
shape=(65536, 256), acc=['cub'] : CPU: 39.122 us +/- 76.225 (min: 20.477 / max: 371.281) us GPU-0: 128.051 us +/- 77.065 (min: 108.544 / max: 463.872) us
shape=(65536, 256), acc=[] : CPU: 26.882 us +/- 2.808 (min: 24.816 / max: 36.989) us GPU-0: 202.240 us +/- 2.660 (min: 199.680 / max: 211.968) us
shape=(131072, 128), acc=['cub']: CPU: 21.769 us +/- 2.015 (min: 20.487 / max: 29.721) us GPU-0: 195.789 us +/- 2.234 (min: 194.560 / max: 204.800) us
shape=(131072, 128), acc=[] : CPU: 27.187 us +/- 3.214 (min: 24.456 / max: 37.588) us GPU-0: 194.970 us +/- 3.096 (min: 192.512 / max: 206.848) us
shape=(262144, 64), acc=['cub'] : CPU: 21.995 us +/- 1.813 (min: 20.775 / max: 28.899) us GPU-0: 361.882 us +/- 2.007 (min: 360.448 / max: 369.664) us
shape=(262144, 64), acc=[] : CPU: 28.027 us +/- 5.230 (min: 24.513 / max: 47.760) us GPU-0: 194.611 us +/- 3.373 (min: 191.488 / max: 203.776) us
shape=(524288, 32), acc=['cub'] : CPU: 22.112 us +/- 1.843 (min: 20.612 / max: 28.637) us GPU-0: 688.538 us +/- 2.084 (min: 687.104 / max: 696.320) us
shape=(524288, 32), acc=[] : CPU: 28.239 us +/- 6.005 (min: 24.802 / max: 52.128) us GPU-0: 208.589 us +/- 3.860 (min: 205.824 / max: 221.184) us
shape=(1048576, 16), acc=['cub']: CPU: 22.190 us +/- 1.817 (min: 20.727 / max: 28.973) us GPU-0: 1351.373 us +/- 2.150 (min: 1349.632 / max: 1359.872) us
shape=(1048576, 16), acc=[] : CPU: 26.513 us +/- 2.789 (min: 24.741 / max: 36.449) us GPU-0: 197.530 us +/- 2.746 (min: 195.584 / max: 206.848) us
shape=(2097152, 8), acc=['cub'] : CPU: 22.394 us +/- 1.879 (min: 20.600 / max: 29.345) us GPU-0: 2661.018 us +/- 6.646 (min: 2652.160 / max: 2675.712) us
shape=(2097152, 8), acc=[] : CPU: 26.793 us +/- 2.864 (min: 24.771 / max: 36.733) us GPU-0: 161.946 us +/- 2.883 (min: 159.744 / max: 172.032) us
shape=(4194304, 4), acc=['cub'] : CPU: 23.579 us +/- 2.721 (min: 21.729 / max: 32.016) us GPU-0: 5252.250 us +/- 8.810 (min: 5236.736 / max: 5270.528) us
shape=(4194304, 4), acc=[] : CPU: 26.353 us +/- 2.672 (min: 24.312 / max: 35.980) us GPU-0: 144.640 us +/- 2.687 (min: 142.336 / max: 154.624) us
shape=(8388608, 2), acc=['cub'] : CPU: 23.547 us +/- 2.921 (min: 21.738 / max: 35.030) us GPU-0: 10224.179 us +/- 18.385 (min: 10207.232 / max: 10266.624) us
shape=(8388608, 2), acc=[] : CPU: 27.276 us +/- 3.147 (min: 24.392 / max: 38.883) us GPU-0: 138.445 us +/- 3.232 (min: 135.168 / max: 150.528) us
shape=(16777216, 1), acc=['cub']: CPU: 23.375 us +/- 3.250 (min: 21.518 / max: 33.582) us GPU-0: 20320.819 us +/- 34.711 (min: 20283.392 / max: 20376.575) us
shape=(16777216, 1), acc=[] : CPU: 28.918 us +/- 5.404 (min: 25.087 / max: 46.885) us GPU-0: 137.882 us +/- 5.254 (min: 134.144 / max: 155.648) us
Installation
None
Environment
A100
Additional Information
No response