wjc404's repositories
GEMM_AVX512F
SGEMM and DGEMM subroutines using AVX512F instructions.
Simple_CUDA_GEMM
Sgemm kernel function on Nvidia Pascal GPU, able to achieve 60% theoretical performance.
GEMM_AVX2_FMA3
sgemm and dgemm subroutine for large matrices, slightly outperform Intel MKL
bitonic_fp32_avx_top16
Topk with K = 16 or 32, based on bitonic sort algorithm, using Intel AVX instructions.
COMPLEX_GEMM_AVX2_FMA3
cgemm and zgemm subroutines for large matrices, using avx2 and fma3 instructions, with performance comparable to MKL2018
Language:CGPL-3.0000
cpu_gemm_opt
how to design cpu gemm on x86 with avx256, that can beat openblas.
Language:C++MIT000
GEMM3M_AVX2_FMA3
cgemm3m and zgemm3m subroutines for large matrices, using AVX2 and FMA3 instructions.
Language:CGPL-3.0000
OpenBLAS
OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
Language:FortranBSD-3-Clause000
GPL-3.0000