yy-space (imisszxq)

imisszxq

Geek Repo

Github PK Tool:Github PK Tool

yy-space's starred repositories

mscclpp

MSCCL++: A GPU-driven communication stack for scalable AI applications

Language:C++License:MITStargazers:204Issues:0Issues:0

ktransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Language:PythonLicense:Apache-2.0Stargazers:289Issues:0Issues:0

QuaRot

Code for QuaRot, an end-to-end 4-bit inference of large language models.

Language:PythonLicense:Apache-2.0Stargazers:236Issues:0Issues:0

marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Language:PythonLicense:Apache-2.0Stargazers:509Issues:0Issues:0

llama.cpp

LLM inference in C/C++

Language:C++License:MITStargazers:63623Issues:0Issues:0

sarathi-serve

A low-latency & high-throughput serving engine for LLMs

Language:PythonLicense:Apache-2.0Stargazers:143Issues:0Issues:0

lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Language:PythonLicense:Apache-2.0Stargazers:3891Issues:0Issues:0

cuda-training-series

Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)

Language:CudaStargazers:500Issues:0Issues:0

NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

Language:PythonLicense:Apache-2.0Stargazers:11253Issues:0Issues:0

Mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

Stargazers:980Issues:0Issues:0

vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Language:PythonLicense:Apache-2.0Stargazers:24874Issues:0Issues:0

DistServe

Disaggregated serving system for Large Language Models (LLMs).

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:241Issues:0Issues:0

lectures

Material for cuda-mode lectures

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:2157Issues:0Issues:0

TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

Language:PythonLicense:Apache-2.0Stargazers:1747Issues:0Issues:0

xcpc-algorithm-templates

XCPC/ICPC/CCPC 算法模板

Language:C++License:MITStargazers:478Issues:0Issues:0

sglang

SGLang is yet another fast serving framework for large language models and vision language models.

Language:PythonLicense:Apache-2.0Stargazers:4172Issues:0Issues:0

CUDATutorial

A CUDA tutorial to make people learn CUDA program from 0

Language:CudaStargazers:164Issues:0Issues:0

maxas

Assembler for NVIDIA Maxwell architecture

Language:SassLicense:MITStargazers:936Issues:0Issues:0

How_to_optimize_in_GPU

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.

Language:CudaLicense:Apache-2.0Stargazers:783Issues:0Issues:0

Skywork-MoE

Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models

Stargazers:120Issues:0Issues:0

flash-llm

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Language:CudaLicense:Apache-2.0Stargazers:161Issues:0Issues:0

fp6_llm

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

Language:CudaLicense:Apache-2.0Stargazers:164Issues:0Issues:0

flashinfer

FlashInfer: Kernel Library for LLM Serving

Language:CudaLicense:Apache-2.0Stargazers:983Issues:0Issues:0

qserve

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Language:PythonLicense:Apache-2.0Stargazers:374Issues:0Issues:0

Atom

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Language:CudaStargazers:239Issues:0Issues:0
Language:CudaLicense:MITStargazers:125Issues:0Issues:0

llm.c

LLM training in simple, raw C/CUDA

Language:CudaLicense:MITStargazers:22663Issues:0Issues:0

MatmulTutorial

A Easy-to-understand TensorOp Matmul Tutorial

Language:C++License:Apache-2.0Stargazers:240Issues:0Issues:0

glake

GLake: optimizing GPU memory management and IO transmission.

Language:PythonLicense:Apache-2.0Stargazers:334Issues:0Issues:0

CUDATutorial

A self-learning tutorail for CUDA High Performance Programing.

Language:JavaScriptLicense:Apache-2.0Stargazers:100Issues:0Issues:0