kf-zhang

followers

following

stars

Shanghai

kf-zhang's starred repositories

CS-Notes

:books: 技术面试必备基础知识、Leetcode、计算机操作系统、计算机网络、系统设计

176318 5293 570

spdlog

Fast C++ logging library.

Language:C++NOASSERTION24164 440 2169

llm.c

LLM training in simple, raw C/CUDA

Language:CudaMIT24142 241 139

modern-cpp-tutorial

📚 Modern C++ Tutorial: C++11/14/17/20 On the Fly | https://changkun.de/modern-cpp/

Language:C++MIT24005 619 131

mlx

MLX: An array framework for Apple silicon

Language:C++MIT16839 146 536

horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Language:PythonNOASSERTION14226 335 2241

ml-visuals

🎨 ML Visuals contains figures and templates which you can reuse and customize to improve your scientific writing.

MIT13375 115 49

triton

Development repository for the Triton language and compiler

Language:C++MIT13122 194 1453

micrograd

A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API

Language:Jupyter NotebookMIT10241 149 30

AITemplate

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.

Language:PythonApache-2.04546 82 244

tiny-cuda-nn

Lightning fast C++/CUDA neural network framework

Language:C++NOASSERTION3723 49 389

tvm_mlir_learn

compiler learning resources collect.

Language:Python2104 36 4

TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

Language:PythonApache-2.01885 33 336

corundum

Open source FPGA-based NIC and platform for in-network compute

Language:VerilogNOASSERTION1680 87 165

core-to-core-latency

Measures the latency between CPU cores

Language:Jupyter NotebookMIT1095 13 101

Triton-Puzzles

Puzzles for learning Triton

Language:Jupyter NotebookApache-2.01029 10 11

gdrcopy

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology

Language:C++MIT872 55 185

ringattention

Transformers with Arbitrarily Large Context

Language:PythonApache-2.0571 5 15

ring-flash-attention

Ring attention implementation with flash attention

Language:PythonMIT565 10 34

multi-gpu-programming-models

Examples demonstrating available options to program multiple GPUs in a single node or a cluster

Language:CudaBSD-3-Clause547 29 10

hack-SysML

The road to hack SysML and become an system expert

Language:Emacs Lisp429 32 2

how-to-learn-deep-learning-framework

how to learn PyTorch and OneFlow

Apache-2.0342 7 1

kiuikit

A toolkit for 3D computer vision tasks.

Language:PythonApache-2.0202 6 13

tiny-flash-attention

flash attention tutorial written in python, triton, cuda, cutlass

Language:Cuda182 3 8

cutlass-kernels

Language:CudaMIT153 10 5

ring-attention

ring-attention experiments

Language:PythonApache-2.089 9 10

gpu-arch-microbenchmark

Dissecting NVIDIA GPU Architecture

Language:Cuda80 2 2

cute-gemm

Language:C++74 2 4

compile-time-printer

Prints values and types during compilation!

Language:PythonBSL-1.055 5 11

ring-attention-pytorch

tiny ring attention implement for learning purpose

Language:Python5 3 1