Matt Jordan's repositories
geometric-certificates
Geometric Certifications of Neural Nets
minhash-rs
Minhashing done in rust
Contrastive-Inversion
Using contrastive learning and OpenAI's CLIP to find good embeddings for images with lossy transformations
pytorch_unbg
Removes backgrounds for pytorch settings
tokshuf-rust
Tokenize/Shuffle tooling written in Rust
bit-diffusion
Implementation of Bit Diffusion, Hinton's group's attempt at discrete denoising diffusion, in Pytorch
deduplicate-text-datasets
for decontamination
docshuffle-rs
Uses the local-cell mapper pattern to fully shuffle a collection of jsonl documents in rust
fastargs
Python library for argument and configuration management
parquet-hf-rs
Converts zstd jsonls to parquets in rust
ray
Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
reservoir-datastats-rs
Multithreaded reservoir sampling for doc-length (also counts tokens globally :D)
rust-exact-dedup
Exact deduplication with rust and option to count presence
sa_decontamination
Suffix Array based decontamination tools
swav-cifar100
PyTorch implementation of SwAV https//arxiv.org/abs/2006.09882
text-subsample-rs
Methods for subsampling text datasets (with emphasis on "duplicate aware subsampling")
token-counter-rs
Simple rust utility to count tokens from tarfiles of contexts
wimbd
What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets