EmbeddedLLM's repositories
SageAttention-rocm
ROCm Quantized Attention that achieves speedups of 2.1x and 2.7x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
vllmWorkshop
vLLM Workshop Content
flash-attention-docker
This is a repository that contains a CI/CD that will try to compile docker images that already built flash attention into the image to facilitate quicker development and deployment of other frameworks.
vllm-rocmfork
A high-throughput and memory-efficient inference and serving engine for LLMs
jamaibase-ts-docs
Typescript Documentation of JamAISDK
aiter
AI Tensor Engine for ROCm
aiter-api-watcher
This is a repository to monitor the fast changing ROCm/aiter repository to alert user that AITER function of interests e.g. in vLLM, in SGLang has been updated at certain commit.
axolotl-amd
Go ahead and axolotl questions
composable_kernel
Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
etalon
LLM Serving Performance Evaluation Harness
infinity-executable
Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting a wide range of text-embedding models and frameworks.
Liger-Kernel
Efficient Triton Kernels for LLM Training
litellm
Python SDK, Proxy Server (LLM Gateway) to call 100+ LLM APIs in OpenAI format - [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, Replicate, Groq]
LLM_Sizing_Guide
A calculator to estimate the memory footprint, capacity, and latency on NVIDIA AMD Intel
lmcache-vllm
The driver for LMCache core to run in vLLM
Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
roxl
NVIDIA Inference Xfer Library (NIXL)
skypilot
SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
Star-Attention
Efficient LLM Inference over Long Sequences
torchac_rocm
ROCm Implementation of torchac_cuda from LMCache