Jianfeng Yan's repositories
amgcl
C++ library for solving large sparse linear systems with algebraic multigrid method
aviation2017_talk
aviation2017 talk
blislab
BLISlab: A Sandbox for Optimizing GEMM
cmake-examples
Useful CMake Examples
COSMA
Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm
cpu_gemm_opt
how to design cpu gemm on x86 with avx256, that can beat openblas.
DeepLearningSystem
Deep Learning System core principles introduction.
HIP-Performance-Optmization-on-VEGA64
14 basic topics for VEGA64 performance optmization
How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the program on the GPU in detail. The reduce optimization has been completed. The optimization of GEMM has completed the CUDA C code. The assembler is currently being used to tune the code, and the code will be issued later.
howto
Build recipies and other howtos
jacobi-svd
Numerical experiments on Jacobi SVD algorithm
llm-course
Course to get into Large Language Models (LLMs) with roadmaps and Colab notebooks.
llm.c
LLM training in simple, raw C/CUDA
Modern-CPP-Programming
Modern C++ Programming Course (C++11/14/17/20)
Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F
Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually.
Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
Optimizing SGEMM kernel functions on RTX 2080 Super to a close-to-cuBLAS performance.
Optimizing-SGEMV-on-NVIDIA-GPUs
An implementation of SGEMV with performance comparable to cuBLAS.
siam_cse2017_poster
poster_SAT_for_2nd_PDE
ulmBLAS
ulmBLAS
wmma_extension
An extension library of WMMA API (Tensor Core API)