Marianna's repositories
urls2dataset
Convert URLs of webpages to dataset
doc2dataset
A tool to extract text (and images) from documents (like PDFs)
audio-dataset
Audio Dataset for training CLAP and other models
laion-datasets
Description and pointers of laion datasets
openbioml-datasets
bio-datasets
tiktok_scraper
TikTok video URLs craper
any2dataset
Turn any collection of files into a dataset
audio2dataset
Easily turn large sets of audio urls to an audio dataset. More info - TBD
audiolm-pytorch
Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch
cc2dataset
Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
CLAP
Contrastive Language-Audio Pretraining
datatrove
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
heat
Distributed tensors and Machine Learning framework with GPU and MPI acceleration in Python
open_clip
An open source implementation of CLIP.
pii-data
Base data structures for PII management
pii-transform
Perform transformations on PII instances detected in documents
PointNeXt
[NeurIPS'22] PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies
riffusion
Stable diffusion for real-time music generation
transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
video2dataset
Easily create large video dataset from video urls