wpybtw / test

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

bash run.sh

using m=10000000 n=64
 CUDA kernel takes 10 ms
 verified 
Input shape torch.Size([10000000, 64])
TORCH max takes 24.503946 ms

Tested on sm70 and sm86(RTX 4060Ti)

About


Languages

Language:Cuda 63.2%Language:C++ 20.6%Language:Python 9.6%Language:Shell 6.6%