Beast code in Giters

butterluo's repositories

autogen

A programming framework for agentic AI. Discord: https://aka.ms/autogen-dc. Roadmap: https://aka.ms/autogen-roadmap

CC-BY-4.0000

ByteTransformer

optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052

Language:C++Apache-2.0010

CoFiPruning

ACL'22: Structured Pruning Learns Compact and Accurate Models https://arxiv.org/abs/2204.00408

Language:PythonMIT010

cuda-samples

Samples for CUDA Developers which demonstrates features in CUDA Toolkit

Language:CNOASSERTION010

cutlass

CUDA Templates for Linear Algebra Subroutines

Language:C++NOASSERTION020

DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

Language:C++Apache-2.0010

DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

Language:PythonApache-2.0010

FasterTransformer

Transformer related optimization, including BERT, GPT

Language:C++Apache-2.0010

flash-attention

Fast and memory-efficient exact attention

Language:PythonBSD-3-Clause010

GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as well as a performance visualization tool, AerialVisoin, and an integrated energy model, GPUWattch.

Language:C++NOASSERTION010

hfai-models

HFAI deep learning models

Language:PythonMIT010

How_to_optimize_in_GPU

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.

Language:CudaApache-2.0010

HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training

Language:C++Apache-2.0010

kubernetes

Production-Grade Container Scheduling and Management

Language:GoApache-2.0010

lightseq

LightSeq: A High Performance Library for Sequence Processing and Generation

Language:C++NOASSERTION010

Merlin

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

Language:PythonApache-2.0020

MNN

MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba

Language:C++010

nccl

Optimized primitives for collective multi-GPU communication

Language:C++NOASSERTION010

nccl-fastsocket

NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.

Language:C++NOASSERTION010

NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

Language:PythonApache-2.0010

oneflow

OneFlow is a performance-centered and open-source deep learning framework.

Language:C++Apache-2.0010

Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.

Language:PythonApache-2.0010

pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Language:PythonNOASSERTION010

spark

Mirror of Apache Spark

Language:ScalaApache-2.0030

tensorflow

An Open Source Machine Learning Framework for Everyone

Language:C++Apache-2.0010

TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Language:C++Apache-2.0000

butterluo

butterluo's repositories

autogen

ByteTransformer

CoFiPruning

cuda-samples

cutlass

DALI

DeepSpeed

FasterTransformer

flash-attention

gpgpu-sim_distribution

hfai-models

How_to_optimize_in_GPU

HugeCTR

kubernetes

lightseq

Merlin

MNN

nccl

nccl-fastsocket

NVTabular

oneflow

Open-Assistant

pytorch

spark

tensorflow

TensorRT-LLM

torchrec

Transformers4Rec

tvm

YHs_Sample