OswaldHe

Oswald(Zifan) He's starred repositories

qserve

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Language:PythonApache-2.035500

SET-ISCA2023

The framework for the paper "Inter-layer Scheduling Space Definition and Exploration for Tiled Accelerators" in ISCA 2023.

Language:C++4000

TAPA-CS

Language:AdaMIT700

HMT-pytorch

Official Implementation of "HMT: Hierarchical Memory Transformer for Long Context Language Processing"

Language:PythonApache-2.05300

mlirPyoclExec

Enabling OpenCL in MLIR via Python

300

pykan

Kolmogorov Arnold Networks

Language:Jupyter NotebookMIT1392300

brevitas

Brevitas: neural network quantization in PyTorch

Language:PythonNOASSERTION113500

recut

Large-scale medical image processing and reconstruction toolbox

Language:C++MIT1800

allo

Allo: A Programming Model for Composable Accelerator Design

Language:PythonApache-2.011000

JetMoE

Reaching LLaMA2 Performance with 0.1M Dollars

Language:PythonApache-2.094700

unlimiformer

Public repo for the NeurIPS 2023 paper "Unlimiformer: Long-Range Transformers with Unlimited Length Input"

Language:PythonMIT104500

LevelST

[FPGA 2024] Source code and bitstream for LevelST: Stream-based Accelerator for Sparse Triangular Solver

Language:TclMIT800

SSR

SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration (Full Paper Accepted in FPGA'24)

Language:C2100

CHARM

CHARM: Composing Heterogeneous Accelerators on Versal ACAP Architecture

Language:C++MIT11500

LM-RMT

Recurrent Memory Transformer

Language:PythonApache-2.014300

SqueezeLLM

[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization

Language:PythonMIT60800

sparsegpt

Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".

Language:PythonApache-2.066700

llama2.cpp

Inference Llama 2 in one file of pure C++

Language:PythonMIT7200

FlexCNN

Language:C++BSD-3-Clause6300

llama

Inference code for Llama models

Language:PythonNOASSERTION5468200

llm-awq

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Language:PythonMIT217600

flash-attention

Fast and memory-efficient exact attention

Language:PythonBSD-3-Clause1263000

vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Language:PythonApache-2.02376100

mosaic

Language:C++NOASSERTION1200

pasta

[FCCM 2023] PASTA: Programming and Automation Support for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAs

Language:C900

LightningSim

A fast, accurate trace-based simulator for High-Level Synthesis.

AGPL-3.03100

YuenyeungSpTRSV

A Thread-Level and Warp-Level Fusion Synchronization-Free Sparse Triangular Solve on GPUs

Language:CMIT600

Callipepla

Large-scale sparse Conjugate Gradient (CG) solvers on High Bandwidth Memory (HBM) FPGAs

Language:C++MIT700

Serpens

Serpens is an HBM FPGA accelerator for SpMV

Language:TclMIT1100

tapa

TAPA is a dataflow HLS framework that features fast compilation, expressive programming model and generates high-frequency FPGA accelerators.

Language:C++MIT14400