kf-zhang

kf-zhang

Geek Repo

Location:Shanghai

Github PK Tool:Github PK Tool

kf-zhang's starred repositories

CS-Notes

:books: 技术面试必备基础知识、Leetcode、计算机操作系统、计算机网络、系统设计

spdlog

Fast C++ logging library.

Language:C++License:NOASSERTIONStargazers:24164Issues:440Issues:2169

llm.c

LLM training in simple, raw C/CUDA

Language:CudaLicense:MITStargazers:24142Issues:241Issues:139

modern-cpp-tutorial

📚 Modern C++ Tutorial: C++11/14/17/20 On the Fly | https://changkun.de/modern-cpp/

mlx

MLX: An array framework for Apple silicon

horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Language:PythonLicense:NOASSERTIONStargazers:14226Issues:335Issues:2241

ml-visuals

🎨 ML Visuals contains figures and templates which you can reuse and customize to improve your scientific writing.

triton

Development repository for the Triton language and compiler

micrograd

A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API

Language:Jupyter NotebookLicense:MITStargazers:10241Issues:149Issues:30

AITemplate

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.

Language:PythonLicense:Apache-2.0Stargazers:4546Issues:82Issues:244

tiny-cuda-nn

Lightning fast C++/CUDA neural network framework

Language:C++License:NOASSERTIONStargazers:3723Issues:49Issues:389

tvm_mlir_learn

compiler learning resources collect.

TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

Language:PythonLicense:Apache-2.0Stargazers:1885Issues:33Issues:336

corundum

Open source FPGA-based NIC and platform for in-network compute

Language:VerilogLicense:NOASSERTIONStargazers:1680Issues:87Issues:165

core-to-core-latency

Measures the latency between CPU cores

Language:Jupyter NotebookLicense:MITStargazers:1095Issues:13Issues:101

Triton-Puzzles

Puzzles for learning Triton

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:1029Issues:10Issues:11

gdrcopy

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology

Language:C++License:MITStargazers:872Issues:55Issues:185

ringattention

Transformers with Arbitrarily Large Context

Language:PythonLicense:Apache-2.0Stargazers:571Issues:5Issues:15

ring-flash-attention

Ring attention implementation with flash attention

Language:PythonLicense:MITStargazers:565Issues:10Issues:34

multi-gpu-programming-models

Examples demonstrating available options to program multiple GPUs in a single node or a cluster

Language:CudaLicense:BSD-3-ClauseStargazers:547Issues:29Issues:10

hack-SysML

The road to hack SysML and become an system expert

how-to-learn-deep-learning-framework

how to learn PyTorch and OneFlow

kiuikit

A toolkit for 3D computer vision tasks.

Language:PythonLicense:Apache-2.0Stargazers:202Issues:6Issues:13

tiny-flash-attention

flash attention tutorial written in python, triton, cuda, cutlass

ring-attention

ring-attention experiments

Language:PythonLicense:Apache-2.0Stargazers:89Issues:9Issues:10

gpu-arch-microbenchmark

Dissecting NVIDIA GPU Architecture

compile-time-printer

Prints values and types during compilation!

Language:PythonLicense:BSL-1.0Stargazers:55Issues:5Issues:11

ring-attention-pytorch

tiny ring attention implement for learning purpose