zhouleidcc's repositories
CenterPoint_deploy
Export CenterPoint PonintPillars ONNX Model For TensorRT
cuda-beginner-course-cpp-version
bilibili视频【CUDA 12.1 并行编程入门(C++语言版)】配套代码
cugraph
cuGraph - RAPIDS Graph Analytics Library
cutlass-b2bgemm
an extension to the cutlass half-precision b2b gemm example
Cutlass_EX
study of cutlass
cutlass_performance_profiling
Exploration of GEMM Performance Improvement with CUTLASS
google-research
Google Research
gpu-toolkit
🦚 🧰 Collection of basic GPU algorithms implemented in CUDA C++.
LKCompiler
small a compiler
llm.c
LLM training in simple, raw C/CUDA
mlir-hello
MLIR Sample dialect
mlir-tutorial_cn
Hands-On Practical MLIR Tutorial
muda
μ-Cuda, yet another painless cuda programming paradigm. With features: intellisense-friendly, structured launch, automatic cuda graph generation and updating.
MV2D
Code for "Object as Query: Lifting any 2D Object Detector to 3D Detection"
onnx-modifier
A tool to modify ONNX models in a visualization fashion, based on Netron and Flask.
pymlir
Python interface for MLIR - the Multi-Level Intermediate Representation
resource-stream
CUDA related news and material links
SHARK-Turbine
Unified compiler/runtime for interfacing with PyTorch Dynamo.
SST
Codes for “Fully Sparse 3D Object Detection” & “Embracing Single Stride 3D Object Detector with Sparse Transformer”
TensorRT
NVIDIA® TensorRT™, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.
TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
torch-xla-SPMD
Pytorch/XLA SPMD Test code in Google TPU
torchsparse
[MLSys'22] TorchSparse: Efficient Point Cloud Inference Engine