umiswing's repositories
NiuTrans.NMT
A Fast Neural Machine Translation System. It is developed in C++ and resorts to NiuTensor for fast tensor APIs.
DocumentSASS
Unofficial description of the CUDA assembly (SASS) instruction sets.
emacs-abyss-theme
A dark theme for Emacs
emacs-catppuccin
🍄 Soothing pastel theme for Emacs
Paddle
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
flash-attention
Fast and memory-efficient exact attention
flux
A fast communication-overlapping library for tensor parallelism on GPUs.
How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the program on the GPU in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.
maxas
Assembler for NVIDIA Maxwell architecture
NeMo
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
PaddleFlashattnTest
Additional tests of flash attention api in paddle
YHs_Sample
Yinghan's Code Sample