UranusSeven

followers

following

stars

Xprobe

Uranus's starred repositories

OpenDevin

🐚 OpenDevin: Code Less, Make More

Language:PythonMIT29216 281 1193

llama3

The official Meta Llama 3 GitHub site

Language:PythonNOASSERTION24949 206 211

presto

The official home of the Presto distributed SQL query engine for big data

Language:JavaApache-2.015805 861 6488

trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

Language:JavaApache-2.09979 173 6472

kohya_ss

Language:PythonApache-2.08960 83 1839

outlines

Structured Text Generation

Language:PythonApache-2.07409 45 515

sd-scripts

Language:PythonApache-2.04598 50 846

LLM-Agent-Survey

Awesome-LLM-Inference

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

GPL-3.02094 82 4

onediff

OneDiff: An out-of-the-box acceleration library for diffusion models.

Language:PythonApache-2.01493 39 394

LookaheadDecoding

Language:PythonApache-2.01056 11 55

Mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

MAP-NEO

Language:Python790 10 32

MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

Language:PythonMIT598 5 34

ringattention

Transformers with Arbitrarily Large Context

Language:PythonApache-2.0571 5 15

ring-flash-attention

Ring attention implementation with flash attention

Language:Python456 9 28

ring-attention-pytorch

Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch

Language:PythonMIT423 10 12

AI-Software-Startups

A Survey of AI startups

Consistency_LLM

[ICML 2024] CLLMs: Consistency Large Language Models

Language:PythonApache-2.0324 9 10

msccl

Microsoft Collective Communication Library

Language:C++NOASSERTION283 13 27

long-context-attention

Sequence Parallel Attention for Long Context LLM Model Training and Inference

Language:Python240 4 9

MatmulTutorial

A Easy-to-understand TensorOp Matmul Tutorial

Language:C++Apache-2.0228 8 9

mscclpp

MSCCL++: A GPU-driven communication stack for scalable AI applications

Language:C++MIT192 17 84

KIVI

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Language:PythonMIT176 3 21

scattermoe

Triton-based implementation of Sparse Mixture of Experts.

Language:PythonApache-2.0152 5 12

sarathi-serve

A low-latency & high-throughput serving engine for LLMs

Language:PythonApache-2.0112 4 6

flux

A fast communication-overlapping library for tensor parallelism on GPUs.

Language:C++Apache-2.0104 7 7

libflash_attn

Standalone Flash Attention v2 kernel without libtorch dependency

Language:C++BSD-3-Clause79 15 5

SwiftTransformer

High performance Transformer implementation in C++.

Language:C++49 10

Odysseus-Transformer

Odysseus: Playground of LLM Sequence Parallelism

Language:Python47 2 1