ZZK (MARD1NO)

MARD1NO

Geek Repo

Company:SiliconFlow

Location:Neverland

Home Page:https://mard1no.github.io/

Github PK Tool:Github PK Tool

ZZK's repositories

TiledCUDA

TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.

Language:C++License:MITStargazers:2Issues:0Issues:0

Nanoflow

A throughput-oriented high-performance serving framework for LLMs

Language:CudaLicense:Apache-2.0Stargazers:1Issues:0Issues:0

nvbench

CUDA Kernel Benchmarking Library

Language:CudaLicense:Apache-2.0Stargazers:1Issues:0Issues:0

fast-hadamard-transform

Fast Hadamard transform in CUDA, with a PyTorch interface

Language:CLicense:BSD-3-ClauseStargazers:0Issues:0Issues:0
Language:PythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0

flux

A fast communication-overlapping library for tensor parallelism on GPUs.

Language:C++License:Apache-2.0Stargazers:0Issues:0Issues:0
Stargazers:0Issues:0Issues:0
Language:C++License:NOASSERTIONStargazers:0Issues:0Issues:0

ktransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Language:PythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0

kvikio

KvikIO - High Performance File IO

Language:C++License:Apache-2.0Stargazers:0Issues:0Issues:0

LLM101n

LLM101n: Let's build a Storyteller

Stargazers:0Issues:0Issues:0
Language:PythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0
Language:HTMLStargazers:0Issues:1Issues:1

marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Language:PythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0

MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

Language:PythonLicense:MITStargazers:0Issues:0Issues:0
Stargazers:0Issues:0Issues:0

mscclpp

MSCCL++: A GPU-driven communication stack for scalable AI applications

Language:C++License:MITStargazers:0Issues:0Issues:0

nvmath-python

NVIDIA Math Libraries for the Python Ecosystem

Language:CythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0

one-api

OpenAI 接口管理 & 分发系统,支持 Azure、Anthropic Claude、Google PaLM 2 & Gemini、智谱 ChatGLM、百度文心一言、讯飞星火认知、阿里通义千问、360 智脑以及腾讯混元,可用于二次分发管理 key,仅单可执行文件,已打包好 Docker 镜像,一键部署,开箱即用. OpenAI key management & redistribution system, using a single API for all LLMs, and features an English UI.

Language:JavaScriptLicense:MITStargazers:0Issues:0Issues:0

QuaRot

Code for QuaRot, an end-to-end 4-bit inference of large language models.

Language:PythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0
Language:PythonLicense:GPL-3.0Stargazers:0Issues:0Issues:0

sarathi-serve

A low-latency & high-throughput serving engine for LLMs

Language:PythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0

SpeculativeDecodingPapers

📰 Must-read papers and blogs on Speculative Decoding ⚡️

License:Apache-2.0Stargazers:0Issues:0Issues:0

SpinQuant

Code repo for the paper "SpinQuant LLM quantization with learned rotations"

Language:PythonLicense:NOASSERTIONStargazers:0Issues:0Issues:0

TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Language:C++License:Apache-2.0Stargazers:0Issues:0Issues:0

TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.

Language:PythonLicense:NOASSERTIONStargazers:0Issues:0Issues:0

triton-linalg

Development repository for the Triton-Linalg conversion

Language:C++License:Apache-2.0Stargazers:0Issues:0Issues:0

unsloth

Finetune Llama 3, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory

Language:PythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0

vattention

Dynamic Memory Management for Serving LLMs without PagedAttention

Language:CLicense:MITStargazers:0Issues:0Issues:0

vidur

A large-scale simulation framework for LLM inference

Language:PythonLicense:MITStargazers:0Issues:0Issues:0