ZZK (MARD1NO)

MARD1NO

Geek Repo

Company:SiliconFlow

Location:Neverland

Home Page:https://mard1no.github.io/

Github PK Tool:Github PK Tool

ZZK's repositories

nvbench

CUDA Kernel Benchmarking Library

Language:CudaLicense:Apache-2.0Stargazers:1Issues:0Issues:0

BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

Language:PythonLicense:MITStargazers:0Issues:0Issues:0

EETQ

Easy and Efficient Quantization for Transformers

Language:C++Stargazers:0Issues:0Issues:0

fast-hadamard-transform

Fast Hadamard transform in CUDA, with a PyTorch interface

Language:CLicense:BSD-3-ClauseStargazers:0Issues:0Issues:0

faster-nougat

Implementation of nougat that focuses on processing pdf locally.

Language:PythonStargazers:0Issues:0Issues:0
Language:PythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0

flux

A fast communication-overlapping library for tensor parallelism on GPUs.

Language:C++License:Apache-2.0Stargazers:0Issues:0Issues:0
Language:C++License:NOASSERTIONStargazers:0Issues:0Issues:0

kvikio

KvikIO - High Performance File IO

Language:C++License:Apache-2.0Stargazers:0Issues:0Issues:0

lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.

Language:PythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0

LLM101n

LLM101n: Let's build a Storyteller

Stargazers:0Issues:0Issues:0
Language:PythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0
Stargazers:0Issues:1Issues:0
Language:HTMLStargazers:0Issues:1Issues:1

MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

Language:PythonLicense:MITStargazers:0Issues:0Issues:0
Stargazers:0Issues:0Issues:0

nvmath-python

NVIDIA Math Libraries for the Python Ecosystem

Language:CythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0

one-api

OpenAI 接口管理 & 分发系统,支持 Azure、Anthropic Claude、Google PaLM 2 & Gemini、智谱 ChatGLM、百度文心一言、讯飞星火认知、阿里通义千问、360 智脑以及腾讯混元,可用于二次分发管理 key,仅单可执行文件,已打包好 Docker 镜像,一键部署,开箱即用. OpenAI key management & redistribution system, using a single API for all LLMs, and features an English UI.

Language:JavaScriptLicense:MITStargazers:0Issues:0Issues:0
Language:PythonStargazers:0Issues:0Issues:0

QuaRot

Code for QuaRot, an end-to-end 4-bit inference of large language models.

Language:PythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0
Language:PythonLicense:GPL-3.0Stargazers:0Issues:0Issues:0

sarathi-serve

A low-latency & high-throughput serving engine for LLMs

Language:PythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0

SpeculativeDecodingPapers

📰 Must-read papers and blogs on Speculative Decoding ⚡️

License:Apache-2.0Stargazers:0Issues:0Issues:0

SpinQuant

Code repo for the paper "SpinQuant LLM quantization with learned rotations"

Language:PythonLicense:NOASSERTIONStargazers:0Issues:0Issues:0

TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.

Language:PythonLicense:NOASSERTIONStargazers:0Issues:0Issues:0

ThunderKittens

Tile primitives for speedy kernels

Language:CudaLicense:MITStargazers:0Issues:0Issues:0

tiny-gpu

A minimal GPU design in Verilog to learn how GPUs work from the ground up

Language:SystemVerilogStargazers:0Issues:0Issues:0

triton-linalg

Development repository for the Triton-Linalg conversion

Language:C++License:Apache-2.0Stargazers:0Issues:0Issues:0

unsloth

Finetune Llama 3, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory

Language:PythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0

vidur

A large-scale simulation framework for LLM inference

Language:PythonLicense:MITStargazers:0Issues:0Issues:0