butterluo's repositories

autogen

A programming framework for agentic AI. Discord: https://aka.ms/autogen-dc. Roadmap: https://aka.ms/autogen-roadmap

License:CC-BY-4.0Stargazers:0Issues:0Issues:0

ByteTransformer

optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052

Language:C++License:Apache-2.0Stargazers:0Issues:1Issues:0

CoFiPruning

ACL'22: Structured Pruning Learns Compact and Accurate Models https://arxiv.org/abs/2204.00408

Language:PythonLicense:MITStargazers:0Issues:1Issues:0

cuda-samples

Samples for CUDA Developers which demonstrates features in CUDA Toolkit

Language:CLicense:NOASSERTIONStargazers:0Issues:1Issues:0

cutlass

CUDA Templates for Linear Algebra Subroutines

Language:C++License:NOASSERTIONStargazers:0Issues:2Issues:0

DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

Language:C++License:Apache-2.0Stargazers:0Issues:1Issues:0

DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

Language:PythonLicense:Apache-2.0Stargazers:0Issues:1Issues:0

FasterTransformer

Transformer related optimization, including BERT, GPT

Language:C++License:Apache-2.0Stargazers:0Issues:1Issues:0

flash-attention

Fast and memory-efficient exact attention

Language:PythonLicense:BSD-3-ClauseStargazers:0Issues:1Issues:0

gpgpu-sim_distribution

GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as well as a performance visualization tool, AerialVisoin, and an integrated energy model, GPUWattch.

Language:C++License:NOASSERTIONStargazers:0Issues:1Issues:0

hfai-models

HFAI deep learning models

Language:PythonLicense:MITStargazers:0Issues:1Issues:0

How_to_optimize_in_GPU

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.

Language:CudaLicense:Apache-2.0Stargazers:0Issues:1Issues:0

HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training

Language:C++License:Apache-2.0Stargazers:0Issues:1Issues:0

kubernetes

Production-Grade Container Scheduling and Management

Language:GoLicense:Apache-2.0Stargazers:0Issues:1Issues:0

lightseq

LightSeq: A High Performance Library for Sequence Processing and Generation

Language:C++License:NOASSERTIONStargazers:0Issues:1Issues:0

Merlin

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

Language:PythonLicense:Apache-2.0Stargazers:0Issues:2Issues:0

MNN

MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba

Language:C++Stargazers:0Issues:1Issues:0

nccl

Optimized primitives for collective multi-GPU communication

Language:C++License:NOASSERTIONStargazers:0Issues:1Issues:0

nccl-fastsocket

NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.

Language:C++License:NOASSERTIONStargazers:0Issues:1Issues:0

NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

Language:PythonLicense:Apache-2.0Stargazers:0Issues:1Issues:0

oneflow

OneFlow is a performance-centered and open-source deep learning framework.

Language:C++License:Apache-2.0Stargazers:0Issues:1Issues:0

Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.

Language:PythonLicense:Apache-2.0Stargazers:0Issues:1Issues:0

pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Language:PythonLicense:NOASSERTIONStargazers:0Issues:1Issues:0

spark

Mirror of Apache Spark

Language:ScalaLicense:Apache-2.0Stargazers:0Issues:3Issues:0

tensorflow

An Open Source Machine Learning Framework for Everyone

Language:C++License:Apache-2.0Stargazers:0Issues:1Issues:0

TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Language:C++License:Apache-2.0Stargazers:0Issues:0Issues:0

torchrec

Pytorch domain library for recommendation systems

Language:PythonLicense:BSD-3-ClauseStargazers:0Issues:1Issues:0

Transformers4Rec

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation and works with PyTorch.

Language:PythonLicense:Apache-2.0Stargazers:0Issues:1Issues:0

tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators

Language:PythonLicense:Apache-2.0Stargazers:0Issues:1Issues:0

YHs_Sample

Yinghan's Code Sample

Language:CudaLicense:GPL-3.0Stargazers:0Issues:1Issues:0