cupy / cupy

NumPy & SciPy for GPU

Home Page:https://cupy.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Batched `cupy.sum` on short reduction axes are slow

asi1024 opened this issue · comments

Description

Using cub::SegmentedReduce::Sum to find batched reduce-sum with short reduction axes seems to degrade performance (cc: @leofang)

To Reproduce

Script:

import cupy
from cupy import testing
from cupyx.profiler import benchmark
from cupy._core import _accelerator


n = 24

for i in range(n + 1):
    shape = (2 ** i, 2 ** (n - i))
    x = testing.shaped_random(shape, xp=cupy, dtype=cupy.float32)
    f = lambda: x.sum(axis=1)

    for acc in (["cub"], []):
        name=f'{shape=}, {acc=}'.ljust(32)
        _accelerator.set_routine_accelerators(acc)
        _accelerator.set_reduction_accelerators(acc)
        perf = benchmark(f, (), n_warmup=1, n_repeat=20, name=name)
        print(perf)

Result:

shape=(1, 16777216), acc=['cub']:    CPU:    26.774 us   +/- 15.314 (min:    20.986 / max:    92.476) us     GPU-0:  2429.798 us   +/- 14.812 (min:  2419.712 / max:  2492.416) us
shape=(1, 16777216), acc=[]     :    CPU:    29.575 us   +/- 12.893 (min:    25.245 / max:    85.183) us     GPU-0: 15859.610 us   +/- 25.041 (min: 15814.656 / max: 15921.152) us
shape=(2, 8388608), acc=['cub'] :    CPU:    22.423 us   +/-  2.995 (min:    20.244 / max:    31.022) us     GPU-0:  1226.445 us   +/-  3.578 (min:  1221.632 / max:  1235.968) us
shape=(2, 8388608), acc=[]      :    CPU:    27.899 us   +/-  9.635 (min:    24.382 / max:    69.227) us     GPU-0:  8306.074 us   +/-  9.421 (min:  8291.328 / max:  8338.432) us
shape=(4, 4194304), acc=['cub'] :    CPU:    21.922 us   +/-  2.638 (min:    20.004 / max:    29.159) us     GPU-0:   626.893 us   +/-  2.834 (min:   623.616 / max:   634.880) us
shape=(4, 4194304), acc=[]      :    CPU:    25.847 us   +/-  2.671 (min:    24.650 / max:    36.779) us     GPU-0:  4175.360 us   +/-  4.562 (min:  4166.656 / max:  4188.160) us
shape=(8, 2097152), acc=['cub'] :    CPU:    38.684 us   +/- 76.913 (min:    19.650 / max:   373.826) us     GPU-0:   348.774 us   +/- 77.878 (min:   327.680 / max:   688.128) us
shape=(8, 2097152), acc=[]      :    CPU:    25.308 us   +/-  2.838 (min:    23.652 / max:    36.125) us     GPU-0:  2139.699 us   +/-  5.365 (min:  2128.896 / max:  2151.424) us
shape=(16, 1048576), acc=['cub']:    CPU:    20.951 us   +/-  1.941 (min:    19.680 / max:    28.707) us     GPU-0:   182.016 us   +/-  1.993 (min:   180.224 / max:   189.440) us
shape=(16, 1048576), acc=[]     :    CPU:    25.276 us   +/-  2.068 (min:    23.885 / max:    33.285) us     GPU-0:  1096.909 us   +/-  4.116 (min:  1090.560 / max:  1108.992) us
shape=(32, 524288), acc=['cub'] :    CPU:    20.958 us   +/-  1.712 (min:    19.784 / max:    27.497) us     GPU-0:   109.619 us   +/-  2.476 (min:   106.496 / max:   117.760) us
shape=(32, 524288), acc=[]      :    CPU:    25.534 us   +/-  2.463 (min:    23.556 / max:    34.311) us     GPU-0:   563.558 us   +/-  3.876 (min:   558.080 / max:   577.536) us
shape=(64, 262144), acc=['cub'] :    CPU:    22.116 us   +/-  3.914 (min:    19.856 / max:    36.777) us     GPU-0:    78.694 us   +/-  3.822 (min:    74.752 / max:    92.160) us
shape=(64, 262144), acc=[]      :    CPU:    25.496 us   +/-  2.592 (min:    23.864 / max:    35.152) us     GPU-0:   295.373 us   +/-  3.097 (min:   291.840 / max:   306.176) us
shape=(128, 131072), acc=['cub']:    CPU:    20.917 us   +/-  1.745 (min:    19.735 / max:    27.680) us     GPU-0:    69.786 us   +/-  2.029 (min:    68.608 / max:    77.824) us
shape=(128, 131072), acc=[]     :    CPU:    25.511 us   +/-  2.386 (min:    23.795 / max:    34.476) us     GPU-0:   174.541 us   +/-  2.518 (min:   172.032 / max:   184.320) us
shape=(256, 65536), acc=['cub'] :    CPU:    21.537 us   +/-  1.787 (min:    20.550 / max:    28.584) us     GPU-0:    68.045 us   +/-  1.874 (min:    66.560 / max:    75.776) us
shape=(256, 65536), acc=[]      :    CPU:    26.720 us   +/-  2.939 (min:    24.254 / max:    36.635) us     GPU-0:   111.821 us   +/-  2.978 (min:   108.544 / max:   121.856) us
shape=(512, 32768), acc=['cub'] :    CPU:    21.592 us   +/-  1.662 (min:    20.574 / max:    28.206) us     GPU-0:    68.250 us   +/-  1.840 (min:    66.560 / max:    75.776) us
shape=(512, 32768), acc=[]      :    CPU:    27.014 us   +/-  2.458 (min:    24.568 / max:    35.970) us     GPU-0:   111.155 us   +/-  1.956 (min:   108.544 / max:   117.760) us
shape=(1024, 16384), acc=['cub']:    CPU:    21.647 us   +/-  1.748 (min:    20.584 / max:    28.635) us     GPU-0:    69.069 us   +/-  2.135 (min:    67.584 / max:    77.824) us
shape=(1024, 16384), acc=[]     :    CPU:    27.259 us   +/-  3.538 (min:    24.602 / max:    37.223) us     GPU-0:    96.205 us   +/-  3.598 (min:    93.184 / max:   106.496) us
shape=(2048, 8192), acc=['cub'] :    CPU:    21.751 us   +/-  1.749 (min:    20.611 / max:    28.420) us     GPU-0:    68.403 us   +/-  1.820 (min:    66.560 / max:    75.776) us
shape=(2048, 8192), acc=[]      :    CPU:    26.338 us   +/-  2.323 (min:    24.458 / max:    35.013) us     GPU-0:    91.494 us   +/-  2.319 (min:    89.088 / max:   100.352) us
shape=(4096, 4096), acc=['cub'] :    CPU:    21.637 us   +/-  1.823 (min:    20.432 / max:    28.816) us     GPU-0:    68.403 us   +/-  1.606 (min:    66.560 / max:    74.752) us
shape=(4096, 4096), acc=[]      :    CPU:    26.932 us   +/-  2.779 (min:    24.665 / max:    36.798) us     GPU-0:    97.434 us   +/-  2.676 (min:    95.232 / max:   107.520) us
shape=(8192, 2048), acc=['cub'] :    CPU:    21.714 us   +/-  2.099 (min:    20.487 / max:    29.103) us     GPU-0:    69.222 us   +/-  2.158 (min:    67.584 / max:    77.824) us
shape=(8192, 2048), acc=[]      :    CPU:    26.533 us   +/-  2.480 (min:    24.702 / max:    35.654) us     GPU-0:   108.134 us   +/-  2.517 (min:   106.496 / max:   117.760) us
shape=(16384, 1024), acc=['cub']:    CPU:    21.689 us   +/-  1.934 (min:    20.536 / max:    29.293) us     GPU-0:    77.158 us   +/-  1.950 (min:    75.776 / max:    84.992) us
shape=(16384, 1024), acc=[]     :    CPU:    27.076 us   +/-  3.201 (min:    24.885 / max:    36.887) us     GPU-0:   139.622 us   +/-  3.177 (min:   137.216 / max:   149.504) us
shape=(32768, 512), acc=['cub'] :    CPU:    21.522 us   +/-  1.818 (min:    20.387 / max:    28.492) us     GPU-0:    76.698 us   +/-  2.170 (min:    74.752 / max:    84.992) us
shape=(32768, 512), acc=[]      :    CPU:    26.370 us   +/-  2.306 (min:    24.647 / max:    34.772) us     GPU-0:   213.043 us   +/-  2.368 (min:   210.944 / max:   222.208) us
shape=(65536, 256), acc=['cub'] :    CPU:    39.122 us   +/- 76.225 (min:    20.477 / max:   371.281) us     GPU-0:   128.051 us   +/- 77.065 (min:   108.544 / max:   463.872) us
shape=(65536, 256), acc=[]      :    CPU:    26.882 us   +/-  2.808 (min:    24.816 / max:    36.989) us     GPU-0:   202.240 us   +/-  2.660 (min:   199.680 / max:   211.968) us
shape=(131072, 128), acc=['cub']:    CPU:    21.769 us   +/-  2.015 (min:    20.487 / max:    29.721) us     GPU-0:   195.789 us   +/-  2.234 (min:   194.560 / max:   204.800) us
shape=(131072, 128), acc=[]     :    CPU:    27.187 us   +/-  3.214 (min:    24.456 / max:    37.588) us     GPU-0:   194.970 us   +/-  3.096 (min:   192.512 / max:   206.848) us
shape=(262144, 64), acc=['cub'] :    CPU:    21.995 us   +/-  1.813 (min:    20.775 / max:    28.899) us     GPU-0:   361.882 us   +/-  2.007 (min:   360.448 / max:   369.664) us
shape=(262144, 64), acc=[]      :    CPU:    28.027 us   +/-  5.230 (min:    24.513 / max:    47.760) us     GPU-0:   194.611 us   +/-  3.373 (min:   191.488 / max:   203.776) us
shape=(524288, 32), acc=['cub'] :    CPU:    22.112 us   +/-  1.843 (min:    20.612 / max:    28.637) us     GPU-0:   688.538 us   +/-  2.084 (min:   687.104 / max:   696.320) us
shape=(524288, 32), acc=[]      :    CPU:    28.239 us   +/-  6.005 (min:    24.802 / max:    52.128) us     GPU-0:   208.589 us   +/-  3.860 (min:   205.824 / max:   221.184) us
shape=(1048576, 16), acc=['cub']:    CPU:    22.190 us   +/-  1.817 (min:    20.727 / max:    28.973) us     GPU-0:  1351.373 us   +/-  2.150 (min:  1349.632 / max:  1359.872) us
shape=(1048576, 16), acc=[]     :    CPU:    26.513 us   +/-  2.789 (min:    24.741 / max:    36.449) us     GPU-0:   197.530 us   +/-  2.746 (min:   195.584 / max:   206.848) us
shape=(2097152, 8), acc=['cub'] :    CPU:    22.394 us   +/-  1.879 (min:    20.600 / max:    29.345) us     GPU-0:  2661.018 us   +/-  6.646 (min:  2652.160 / max:  2675.712) us
shape=(2097152, 8), acc=[]      :    CPU:    26.793 us   +/-  2.864 (min:    24.771 / max:    36.733) us     GPU-0:   161.946 us   +/-  2.883 (min:   159.744 / max:   172.032) us
shape=(4194304, 4), acc=['cub'] :    CPU:    23.579 us   +/-  2.721 (min:    21.729 / max:    32.016) us     GPU-0:  5252.250 us   +/-  8.810 (min:  5236.736 / max:  5270.528) us
shape=(4194304, 4), acc=[]      :    CPU:    26.353 us   +/-  2.672 (min:    24.312 / max:    35.980) us     GPU-0:   144.640 us   +/-  2.687 (min:   142.336 / max:   154.624) us
shape=(8388608, 2), acc=['cub'] :    CPU:    23.547 us   +/-  2.921 (min:    21.738 / max:    35.030) us     GPU-0: 10224.179 us   +/- 18.385 (min: 10207.232 / max: 10266.624) us
shape=(8388608, 2), acc=[]      :    CPU:    27.276 us   +/-  3.147 (min:    24.392 / max:    38.883) us     GPU-0:   138.445 us   +/-  3.232 (min:   135.168 / max:   150.528) us
shape=(16777216, 1), acc=['cub']:    CPU:    23.375 us   +/-  3.250 (min:    21.518 / max:    33.582) us     GPU-0: 20320.819 us   +/- 34.711 (min: 20283.392 / max: 20376.575) us
shape=(16777216, 1), acc=[]     :    CPU:    28.918 us   +/-  5.404 (min:    25.087 / max:    46.885) us     GPU-0:   137.882 us   +/-  5.254 (min:   134.144 / max:   155.648) us

Installation

None

Environment

A100

Additional Information

No response