Zhiwei35

Zhiwei35's repositories

LLMdraft

Language:Cuda100

PLCT-Open-Reports

PLCT实验室有关RISC-V和MLIR的slides和report

CC-BY-SA-4.0100

Awesome-GPU

Awesome resources for GPUs

BSD-3-Clause000

code-samples

Source code examples from the Parallel Forall Blog

Language:HTMLBSD-3-Clause000

Cpp_houjie

侯捷C++课程PPT及代码

Language:C++000

CPP_Optimizations_Diary

Tips and tricks to optimize your C++ code

Language:C++000

cutlass

CUDA Templates for Linear Algebra Subroutines

Language:C++NOASSERTION000

DeepLearningSystem

Deep Learning System core principles introduction.

Language:Jupyter NotebookApache-2.0000

flash_attention_inference

compressed version of flash attn to flash decoding

Language:C++MIT000

GPU_Microbenchmark

Language:Cuda000

HPCInfo

Information about many aspects of high-performance computing. Wiki content moved to ~/docs.

Language:C++MIT000

IOS

[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration

Language:C++MIT000

Megatron-LM

Ongoing research training transformer models at scale

Language:PythonNOASSERTION000

modern-cpp-tutorial

📚 Modern C++ Tutorial: C++11/14/17/20 On the Fly | https://changkun.de/modern-cpp/

Language:C++MIT000

MyTinySTL

STL class impl in C++11

Language:C++NOASSERTION000

llama.cpp

Pure C/C++ LLaMA

MIT000

LLM_final

Language:Cuda000

megablocks

Apache-2.0000

nnfusion

A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.

MIT000

Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F

Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.

GPL-3.0000

OptimizingSeriesTranslation

Chinese version for Agner Fog's optimizing series

000

PaddleCustomDevice

PaddlePaddle custom device implementaion. (『飞桨』自定义硬件接入实现)

Apache-2.0000

train-LeNet5-by-cuda

train a LeNet5 with Cuda

000