A curated list of resources dedicated to Synthetic Data
If you want to contribute to this list, read the contribution guidelines first. Please add your favorite synthetic data resource by raising a pull request
Also, a listed repository should be deprecated if:
- Repository's owner explicitly says that "this library is not maintained".
- Not committed for a long time (2~3 years).
- Research Summaries and Trends
- Tutorials
- Libraries
- Academic Papers
- Services
- Prominent Synthetic Data Research Labs
- Datasets
Introductions and Guides to Synthetic Data
Blogs and Newsletters
- The Unreasonable Effectiveness of Recurrent Neural Networks - Andrej Karpathy's intro to RNNs.
- Annotated Diffusion - Tutorial on original diffusion model paper with code
Videos and Online Courses
- Learning to Generate Data by Estimating Gradients of the Data Distribution - Video by Yang Song from Stanford. Excellent theory and interesting applications.
Open Source Generative Synthetic Data Models, Libraries and Frameworks | Back to Top
- gretel-synthetics - Generative models for structured and unstructured text, tabular, and multi-variate time-series data featuring differentially private learning.
- SDV - Synthetic Data Generator for tabular, relational, and time series data.
- Synthea - Synthetic Patient Population Simulator.
- ydata-synthetic - Synthetic structured data generators.
- synthpop - A tool for producing synthetic versions of microdata.
- Contrastive Unpaired Translation - Contrastive unpaired image-to-image translation, faster and lighter training than cyclegan.
- StyleGAN 3 - Official PyTorch implementation of StyleGAN3 from NeurIPS 2021.
- Denoising Diffusion Pytorch - Implementation of DDPM
- Jukebox - OpenAI's Jukebox- A Generative Model for Music.
- AirSim - AirSim is a simulator for drones, cars and more, built on Unreal and Unity engines.
- Nvidia Dataset Synthesizer - NDDS is a UE4 plugin from NVIDIA to empower computer vision researchers to export high-quality synthetic images with metadata.
- OpenAI Gym - A toolkit for developing and comparing reinforcement learning algorithms.
- Unity Perception Perception toolkit for sim2real training and validation in Unity.
- Video Diffusion Pytorch - Implementation of video diffusion models in pytorch
- Evaluating Large Language Models Trained on Code (2021) Mark Chen et al. [pdf]
- Modeling Tabular Data using Conditional GAN (2019) Xu et al. [pdf]
- Generating Long Videos of Dynamic Scenes (2022) Tim Brooks [pdf]
- Generative Adversarial Networks (2014) Ian J. Goodfellow et al. [pdf]
- Conditional Generative Adversarial Nets (2014) Mehdi Mirza et al. [pdf]
- Modeling Tabular Data using Conditional GAN (2019) Xu et al. [pdf]
- Wasserstein GAN (2017) Martin Arjovsky, et al.[pdf]
- Improved Training of Wasserstein GANs (2017) Ishaan Gulrajani, et al. [pdf]
- Time-series Generative Adversarial Networks (2019) Jinsung Yoon, et all [pdf]
- Generative Modeling by Estimating Gradients of the Data Distribution (2021) Yang Song [pdf]
- Diffusion Models are Autoencoders S. Dielman (2021) [pdf]
- Deep Unsupervised Learning using Nonequilibrium Thermodynamics (2015) J Sohl-Dickstein et al. [pdf]
- KNN-Diffusion: Image Generation via Large-Scale Retrieval (2022) Oron Ashual [pdf]
- A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle (2021) Harini Suresh, John Guttag [pdf]
- DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks (2021) Boris van Breugel et al [pdf]
- On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? (2021) Emily M. Bender, et al. [pdf]
- A Survey on Bias and Fairness in Machine Learning (2022) Ninareh Mehrabi [pdf]
- AI Fairness (Approaches & Mathematical Definitions) (2022) Jonathan Hui [blog]
- AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias (2018) Rachel K. E. Bellamy et al [pdf]
- Deep Learning with Differential Privacy (2016) Abadi et al. [pdf]
- An Efficient DP-SGD Mechanism for Large Scale NLP Models (2021) Dupuy et al. [pdf]
- PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees (2018) Jordon et al. [pdf]
- Don't Generate Me: Training Differentially Private Generative Models with Sinkhorn Divergence (2021) Cao et al. [pdf]
- Differentially Private Fine-tuning of Language Models (2022) Yu et al. [pdf]
Synthetic Data as API with higher level functionality such model training, fine-tuning, and generation | Back to Top
- List of Synthetic Data Startups in 2021 - Not all of these necessarily have APIs.
- HuggingFace Datasets - Library for easily accessing and sharing datasets, and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks.
- Google Cloud Public Datasets - Publicly available and free machine learning and analytics datasets.
- Kaggle Datasets - Data science and machine learning datasets.
- /r/datasets - A place to share, find, and discuss Datasets.
- Papers with Code - Datasets - The mission of Papers with Code is to create a free and open resource with Machine Learning papers, code, datasets, methods and evaluation tables.
- Awesome Public Datasets - Topic centric, high quality, public data sources
- Data.gov - U.S. Government's open data
- UC Irvine Machine Learning Repository - Popular datasets used by the machine learning community
- Google Research Dataset Search - Discover datasets hosted in thousands of repositories across the web
License - CC0