Uranus (UranusSeven)

UranusSeven

Geek Repo

Company:Xprobe

Github PK Tool:Github PK Tool

Uranus's starred repositories

OpenDevin

🐚 OpenDevin: Code Less, Make More

Language:PythonLicense:MITStargazers:30189Issues:281Issues:1252

llama3

The official Meta Llama 3 GitHub site

Language:PythonLicense:NOASSERTIONStargazers:26541Issues:219Issues:250

mem0

The Memory layer for your AI apps

Language:PythonLicense:Apache-2.0Stargazers:22213Issues:127Issues:657
Language:PythonLicense:Apache-2.0Stargazers:9468Issues:92Issues:2007

outlines

Structured Text Generation

Language:PythonLicense:Apache-2.0Stargazers:8187Issues:47Issues:553
Language:PythonLicense:Apache-2.0Stargazers:5059Issues:54Issues:1033

Awesome-LLM-Inference

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

lectures

Material for cuda-mode lectures

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:2487Issues:35Issues:7

onediff

OneDiff: An out-of-the-box acceleration library for diffusion models.

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:1618Issues:40Issues:433

ThunderKittens

Tile primitives for speedy kernels

Language:CudaLicense:MITStargazers:1525Issues:25Issues:26

Mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

MInference

[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

Language:PythonLicense:MITStargazers:726Issues:6Issues:53

ringattention

Transformers with Arbitrarily Large Context

Language:PythonLicense:Apache-2.0Stargazers:622Issues:6Issues:16

ring-attention-pytorch

Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch

Language:PythonLicense:MITStargazers:461Issues:11Issues:14

AI-Software-Startups

A Survey of AI startups

ttt-lm-jax

Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States

msccl

Microsoft Collective Communication Library

Language:C++License:NOASSERTIONStargazers:309Issues:13Issues:27

LLM-Viewer

Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.

Language:PythonLicense:MITStargazers:285Issues:2Issues:9

MatmulTutorial

A Easy-to-understand TensorOp Matmul Tutorial

Language:C++License:Apache-2.0Stargazers:278Issues:8Issues:11

mscclpp

MSCCL++: A GPU-driven communication stack for scalable AI applications

Language:C++License:MITStargazers:235Issues:19Issues:89

KIVI

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Language:PythonLicense:MITStargazers:220Issues:5Issues:24

sarathi-serve

A low-latency & high-throughput serving engine for LLMs

Language:PythonLicense:Apache-2.0Stargazers:201Issues:6Issues:17

flux

A fast communication-overlapping library for tensor parallelism on GPUs.

Language:C++License:Apache-2.0Stargazers:200Issues:7Issues:23

vattention

Dynamic Memory Management for Serving LLMs without PagedAttention

Language:CLicense:MITStargazers:195Issues:2Issues:7

libflash_attn

Standalone Flash Attention v2 kernel without libtorch dependency

Language:C++License:BSD-3-ClauseStargazers:96Issues:15Issues:5

SwiftTransformer

High performance Transformer implementation in C++.

Odysseus-Transformer

Odysseus: Playground of LLM Sequence Parallelism

compressed-tensors

A safetensors extension to efficiently store sparse quantized tensors on disk

Language:PythonLicense:Apache-2.0Stargazers:34Issues:11Issues:4