Zhiwei35's repositories
PLCT-Open-Reports
PLCT实验室有关RISC-V和MLIR的slides和report
Awesome-GPU
Awesome resources for GPUs
code-samples
Source code examples from the Parallel Forall Blog
Cpp_houjie
侯捷C++课程PPT及代码
CPP_Optimizations_Diary
Tips and tricks to optimize your C++ code
cutlass
CUDA Templates for Linear Algebra Subroutines
DeepLearningSystem
Deep Learning System core principles introduction.
flash_attention_inference
compressed version of flash attn to flash decoding
HPCInfo
Information about many aspects of high-performance computing. Wiki content moved to ~/docs.
IOS
[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration
Megatron-LM
Ongoing research training transformer models at scale
modern-cpp-tutorial
📚 Modern C++ Tutorial: C++11/14/17/20 On the Fly | https://changkun.de/modern-cpp/
MyTinySTL
STL class impl in C++11
llama.cpp
Pure C/C++ LLaMA
nnfusion
A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F
Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.
OptimizingSeriesTranslation
Chinese version for Agner Fog's optimizing series
PaddleCustomDevice
PaddlePaddle custom device implementaion. (『飞桨』自定义硬件接入实现)
train-LeNet5-by-cuda
train a LeNet5 with Cuda