Performance Discrepancy B/W 2D and 3D (input) Matrix Multiplications
rajagond opened this issue · comments
Description
There is a significant performance discrepancy between 2D and 3D matrix multiplications in CuPy. When performing a matrix multiplication with 2D inputs, the operation completes significantly faster than with 3D inputs of a similar size. The percentage difference in computation time is unexpectedly high, which suggests a potential performance issue or optimization issue for 3D matrix multiplications. Could you explain if I am missing something?
Cupy Output
python3 cupy_3d_vs_2d.py
input_tensor_2d: (65536, 24576), device id: <CUDA Device 0>
input_tensor_3d: (32, 2048, 24576), device id: <CUDA Device 0>
weights: (24576, 12288), device id: <CUDA Device 0>
output_2d: (65536, 12288), device id: <CUDA Device 0>
output_3d: (32, 2048, 12288), device id: <CUDA Device 0>
CuPy MatMul Time (2d): 182.21650 ms
CuPy MatMul Time (3d): 2285.94271 ms
CuPy MatMul Time (% diff): 1154.52%
Pytorch Output
input_tensor_2d: torch.Size([65536, 24576]), device id: cuda:0
input_tensor_3d: torch.Size([32, 2048, 24576]), device id: cuda:0
weights: torch.Size([24576, 12288]), device id: cuda:0
output_2d: torch.Size([65536, 12288]), device id: cuda:0
output_3d: torch.Size([32, 2048, 12288]), device id: cuda:0
PyTorch 2D Matmul: 182.4959375 ms
PyTorch 3D Matmul: 185.20947916666665 ms
PyTorch Matmul % difference: 1.4869052450368423%
To Reproduce
# cupy_3d_vs_2d.py
import cupy as cp
import argparse
debug = False
def main(batch_size, seq_len, hidden_size, num_gpus, num_warmup, active_iters):
# Set device to GPU:0
cp.cuda.Device(0).use()
cp.cuda.device.get_cublas_handle()
# Create tensors
input_tensor_2d = cp.random.randn(batch_size * seq_len, (4 * hidden_size) // num_gpus, dtype=cp.float32)
input_tensor_3d = cp.random.randn(batch_size, seq_len, (4 * hidden_size) // num_gpus, dtype=cp.float32)
weights = cp.random.randn((4 * hidden_size) // num_gpus, hidden_size, dtype=cp.float32)
output_2d = cp.zeros((batch_size * seq_len, hidden_size), dtype=cp.float16)
output_3d = cp.zeros((batch_size, seq_len, hidden_size), dtype=cp.float16)
# cast to float16
input_tensor_2d = input_tensor_2d.astype(cp.float16)
input_tensor_3d = input_tensor_3d.astype(cp.float16)
weights = weights.astype(cp.float16)
# create custom stream
stream = cp.cuda.Stream(non_blocking=True)
# create events
start_event = cp.cuda.Event(disable_timing=False)
end_event = cp.cuda.Event(disable_timing=False)
print(f"input_tensor_2d: {input_tensor_2d.shape}, device id: {input_tensor_2d.device}", flush=True)
print(f"input_tensor_3d: {input_tensor_3d.shape}, device id: {input_tensor_3d.device}", flush=True)
print(f"weights: {weights.shape}, device id: {weights.device}", flush=True)
print(f"output_2d: {output_2d.shape}, device id: {output_2d.device}", flush=True)
print(f"output_3d: {output_3d.shape}, device id: {output_3d.device}", flush=True)
# 2d matmul
cp.cuda.runtime.deviceSynchronize()
with stream:
# Warmup iterations
for _ in range(num_warmup):
cp.matmul(input_tensor_2d, weights, out=output_2d)
# Active iterations to measure the time
start_event.record(stream=stream)
for _ in range(active_iters):
cp.matmul(input_tensor_2d, weights, out=output_2d)
end_event.record(stream=stream)
end_event.synchronize()
elapsed_time_2d_avg = cp.cuda.get_elapsed_time(start_event, end_event) / active_iters
# 3d matmul
cp.cuda.runtime.deviceSynchronize()
with stream:
# Warmup iterations
for _ in range(num_warmup):
cp.matmul(input_tensor_3d, weights, out=output_3d)
# Active iterations to measure the time
start_event.record(stream=stream)
for _ in range(active_iters):
cp.matmul(input_tensor_3d, weights, out=output_3d)
end_event.record(stream=stream)
end_event.synchronize()
elapsed_time_3d_avg = cp.cuda.get_elapsed_time(start_event, end_event) / active_iters
cp.cuda.runtime.deviceSynchronize()
# Print the results
print(f"CuPy MatMul Time (2d): {elapsed_time_2d_avg:.5f} ms")
print(f"CuPy MatMul Time (3d): {elapsed_time_3d_avg:.5f} ms")
print(f"CuPy MatMul Time (% diff): {100 * (elapsed_time_3d_avg - elapsed_time_2d_avg) / elapsed_time_2d_avg:.2f}%")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('-B', '--batch_size', type=int, required=False, default=32)
parser.add_argument('-l', '--seq_len', type=int, required=False, default=2048)
parser.add_argument('-hs', '--hidden_size', type=int, required=False, default=12288)
parser.add_argument('-g', '--num_gpus', type=int, required=False, default=2)
parser.add_argument('-w', '--num_warmup', type=int, required=False, default=50)
parser.add_argument('-i', '--active_iters', type=int, required=False, default=150)
args = parser.parse_args()
main(args.batch_size, args.seq_len, args.hidden_size, args.num_gpus, args.num_warmup, args.active_iters)
Installation
Wheel (pip install cupy-***
)
Environment
OS : Linux-6.2.0-1014-azure-x86_64-with-glibc2.29
Python Version : 3.8.10
CuPy Version : 12.3.0
CuPy Platform : NVIDIA CUDA
NumPy Version : 1.24.4
SciPy Version : None
Cython Build Version : 0.29.36
Cython Runtime Version : None
CUDA Root : /usr/local/cuda
nvcc PATH : /usr/local/cuda/bin/nvcc
CUDA Build Version : 12020
CUDA Driver Version : 12020
CUDA Runtime Version : 12010
cuBLAS Version : (available)
cuFFT Version : 11002
cuRAND Version : 10302
cuSOLVER Version : (11, 4, 5)
cuSPARSE Version : (available)
NVRTC Version : (12, 1)
Thrust Version : 200101
CUB Build Version : 200101
Jitify Build Version : <unknown>
cuDNN Build Version : (not loaded; try `import cupy.cuda.cudnn` first)
cuDNN Version : (not loaded; try `import cupy.cuda.cudnn` first)
NCCL Build Version : 21602
NCCL Runtime Version : 21701
cuTENSOR Version : None
cuSPARSELt Build Version : None
Device 0 Name : NVIDIA A100 80GB PCIe
Device 0 Compute Capability : 80
Device 0 PCI Bus ID : 0001:00:00.0
Device 1 Name : NVIDIA A100 80GB PCIe
Device 1 Compute Capability : 80
Device 1 PCI Bus ID : 0002:00:00.0
Device 2 Name : NVIDIA A100 80GB PCIe
Device 2 Compute Capability : 80
Device 2 PCI Bus ID : 0003:00:00.0
Device 3 Name : NVIDIA A100 80GB PCIe
Device 3 Compute Capability : 80
Device 3 PCI Bus ID : 0004:00:00.0
Additional Information
No response
Hi @rajagond, thanks for the detailed report. I've confirmed the problem with -B 8
:
input_tensor_2d: (16384, 24576), device id: <CUDA Device 0>
input_tensor_3d: (8, 2048, 24576), device id: <CUDA Device 0>
weights: (24576, 12288), device id: <CUDA Device 0>
output_2d: (16384, 12288), device id: <CUDA Device 0>
output_3d: (8, 2048, 12288), device id: <CUDA Device 0>
CuPy MatMul Time (2d): 51.69049 ms
CuPy MatMul Time (3d): 790.34568 ms
CuPy MatMul Time (% diff): 1429.00%
matmul
has a fast path for 2d case (which just uses cupy.dot
), and looks there is a room for improvement for 3d case.
cupy/cupy/_core/_routines_linalg.pyx
Line 809 in 34f8edb