Beast code in Giters

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

Language:PythonMIT59800

ncnn

ncnn is a high-performance neural network inference framework optimized for the mobile platform

Language:C++NOASSERTION1986200

googletest

GoogleTest - Google Testing and Mocking Framework

Language:C++BSD-3-Clause3384200

ring-flash-attention

Ring attention implementation with flash attention

Language:Python45800

freevpn

免费公益机场节点分享

51000

ebooks

收藏的一些经典的历史、政治、心理、哲学、数学、计算机方面电子书(约10万本）

Language:JavaScript321600

AI-Software-Startups

A Survey of AI startups

MIT39100

Mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

93100

ParrotServe

[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable

Language:PythonMIT7200

CUDA_gemm

A simple high performance CUDA GEMM implementation.

Language:Cuda30400

lectures

Material for cuda-mode lectures

Language:Jupyter NotebookApache-2.0200100

AI-System

System for AI Education Resource.

Language:PythonCC-BY-4.0324600

aiohttp

Asynchronous HTTP client/server framework for asyncio and Python

Language:PythonNOASSERTION1484000

Awesome-RoadMaps-and-Interviews

Awesome Interviews for Coder, Programming Language, Software Engineering, Web, Backend, Distributed Infrastructure, DataScience & AI | 面试必备

Language:HTMLNOASSERTION12800

cs344

Introduction to Parallel Programming class code

Language:Cuda128900

llumnix

Efficient and easy multi-instance LLM serving

Language:PythonApache-2.07400

Nsight-Compute-Docker-Image

Nsight Compute in Docker

Language:DockerfileMIT1000