jundaf's starred repositories

ThunderKittens

Tile primitives for speedy kernels

Language:CudaLicense:MITStargazers:1396Issues:0Issues:0

MatmulTutorial

A Easy-to-understand TensorOp Matmul Tutorial

Language:C++License:Apache-2.0Stargazers:228Issues:0Issues:0
Language:CudaLicense:MITStargazers:115Issues:0Issues:0

remote-dataloader

PyTorch DataLoader processed in multiple remote computation machines for heavy data processings

Language:PythonLicense:MITStargazers:66Issues:0Issues:0

git-filter-repo

Quickly rewrite git repository history (filter-branch replacement)

Language:PythonLicense:NOASSERTIONStargazers:7830Issues:0Issues:0

opencv

Open Source Computer Vision Library

Language:C++License:Apache-2.0Stargazers:77050Issues:0Issues:0

perftest

Infiniband Verbs Performance Tests

Language:CLicense:NOASSERTIONStargazers:550Issues:0Issues:0

bbpe

BPE from byte, for real

Language:C++Stargazers:1Issues:0Issues:0

nccl-rdma-sharp-plugins

RDMA and SHARP plugins for nccl library

Language:CLicense:BSD-3-ClauseStargazers:149Issues:0Issues:0

minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.

Language:PythonLicense:MITStargazers:8766Issues:0Issues:0

depyf

depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.

Language:PythonLicense:MITStargazers:390Issues:0Issues:0

marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Language:PythonLicense:Apache-2.0Stargazers:462Issues:0Issues:0

rfcs

PyTorch RFCs (experimental)

License:NOASSERTIONStargazers:114Issues:0Issues:0

python-bpe

Byte Pair Encoding for Python!

Language:PythonLicense:MITStargazers:221Issues:0Issues:0

dataloader-benchmarks

DL Dataloader Benchmarks

Language:PythonLicense:MITStargazers:18Issues:0Issues:0

tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Language:RustLicense:Apache-2.0Stargazers:8722Issues:0Issues:0

CPU-Free-model

Source code for the CPU-Free model - a fully autonomous execution model for multi-GPU applications that completely excludes the involvement of the CPU beyond the initial kernel launch.

Language:CudaLicense:MITStargazers:15Issues:0Issues:0

multi-gpu-programming-models

Examples demonstrating available options to program multiple GPUs in a single node or a cluster

Language:CudaLicense:BSD-3-ClauseStargazers:490Issues:0Issues:0

tutorial-multi-gpu

Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial

Language:CudaLicense:MITStargazers:156Issues:0Issues:0

bcl

The Berkeley Container Library

Language:C++License:BSD-3-ClauseStargazers:117Issues:0Issues:0

gdrcopy

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology

Language:C++License:MITStargazers:821Issues:0Issues:0

monolith

ByteDance's Recommendation System

Language:PythonLicense:NOASSERTIONStargazers:838Issues:0Issues:0
Language:C++License:MITStargazers:71Issues:0Issues:0
Language:TypeScriptStargazers:7Issues:0Issues:0

cuda_scheduling_examiner_mirror

A tool for examining GPU scheduling behavior.

Language:CudaLicense:NOASSERTIONStargazers:63Issues:0Issues:0

awesome-courses

:books: List of awesome university courses for learning Computer Science!

Stargazers:55592Issues:0Issues:0

NCCL

Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, allGather, reduceScatter and sendRecv operations.

Stargazers:21Issues:0Issues:0

open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source

Language:CLicense:NOASSERTIONStargazers:14277Issues:0Issues:0

gpumembench

A GPU benchmark suite for assessing on-chip GPU memory bandwidth

Language:C++License:GPL-2.0Stargazers:91Issues:0Issues:0

detectron2

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

Language:PythonLicense:Apache-2.0Stargazers:29452Issues:0Issues:0