awesome-synthetic-data

A curated list of resources dedicated to Synthetic Data

If you want to contribute to this list, read the contribution guidelines first. Please add your favorite synthetic data resource by raising a pull request

Also, a listed repository should be deprecated if:

Repository's owner explicitly says that "this library is not maintained".
Not committed for a long time (2~3 years).

Research Summaries and Trends
Tutorials
- Reading Content
- Videos and Online Courses
Libraries
- Text
- Image
- Audio
- Simulation
Academic Papers
Services
Prominent Synthetic Data Research Labs
Datasets

Research Summaries and Trends

Tutorials

Reading Content

Introductions and Guides to Synthetic Data

Blogs and Newsletters

The Unreasonable Effectiveness of Recurrent Neural Networks - Andrej Karpathy's intro to RNNs.
Annotated Diffusion - Tutorial on original diffusion model paper with code

Videos and Online Courses

Diffusion Models

Learning to Generate Data by Estimating Gradients of the Data Distribution - Video by Yang Song from Stanford. Excellent theory and interesting applications.

Libraries

Open Source Generative Synthetic Data Models, Libraries and Frameworks | Back to Top

Text, Tabular and Time-Series

gretel-synthetics - Generative models for structured and unstructured text, tabular, and multi-variate time-series data featuring differentially private learning.
SDV - Synthetic Data Generator for tabular, relational, and time series data.
Synthea - Synthetic Patient Population Simulator.
ydata-synthetic - Synthetic structured data generators.
synthpop - A tool for producing synthetic versions of microdata.

Image

Contrastive Unpaired Translation - Contrastive unpaired image-to-image translation, faster and lighter training than cyclegan.
StyleGAN 3 - Official PyTorch implementation of StyleGAN3 from NeurIPS 2021.
Denoising Diffusion Pytorch - Implementation of DDPM

Audio

Jukebox - OpenAI's Jukebox- A Generative Model for Music.

Simulation

AirSim - AirSim is a simulator for drones, cars and more, built on Unreal and Unity engines.
Nvidia Dataset Synthesizer - NDDS is a UE4 plugin from NVIDIA to empower computer vision researchers to export high-quality synthetic images with metadata.
OpenAI Gym - A toolkit for developing and comparing reinforcement learning algorithms.
Unity Perception Perception toolkit for sim2real training and validation in Unity.

Video

Video Diffusion Pytorch - Implementation of video diffusion models in pytorch

Academic Papers

Language Models

Evaluating Large Language Models Trained on Code (2021) Mark Chen et al. [pdf]

Generative Adversarial Networks (GANs)

Modeling Tabular Data using Conditional GAN (2019) Xu et al. [pdf]
Generating Long Videos of Dynamic Scenes (2022) Tim Brooks [pdf]
Generative Adversarial Networks (2014) Ian J. Goodfellow et al. [pdf]
Conditional Generative Adversarial Nets (2014) Mehdi Mirza et al. [pdf]
Modeling Tabular Data using Conditional GAN (2019) Xu et al. [pdf]
Wasserstein GAN (2017) Martin Arjovsky, et al.[pdf]
Improved Training of Wasserstein GANs (2017) Ishaan Gulrajani, et al. [pdf]
Time-series Generative Adversarial Networks (2019) Jinsung Yoon, et all [pdf]

Diffusion Models

Generative Modeling by Estimating Gradients of the Data Distribution (2021) Yang Song [pdf]
Diffusion Models are Autoencoders S. Dielman (2021) [pdf]
Deep Unsupervised Learning using Nonequilibrium Thermodynamics (2015) J Sohl-Dickstein et al. [pdf]
KNN-Diffusion: Image Generation via Large-Scale Retrieval (2022) Oron Ashual [pdf]

Fair AI

A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle (2021) Harini Suresh, John Guttag [pdf]
DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks (2021) Boris van Breugel et al [pdf]
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? (2021) Emily M. Bender, et al. [pdf]
A Survey on Bias and Fairness in Machine Learning (2022) Ninareh Mehrabi [pdf]
AI Fairness (Approaches & Mathematical Definitions) (2022) Jonathan Hui [blog]
AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias (2018) Rachel K. E. Bellamy et al [pdf]

Algorithmic Privacy

Deep Learning with Differential Privacy (2016) Abadi et al. [pdf]
An Efficient DP-SGD Mechanism for Large Scale NLP Models (2021) Dupuy et al. [pdf]
PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees (2018) Jordon et al. [pdf]
Don't Generate Me: Training Differentially Private Generative Models with Sinkhorn Divergence (2021) Cao et al. [pdf]
Differentially Private Fine-tuning of Language Models (2022) Yu et al. [pdf]

Services

Synthetic Data as API with higher level functionality such model training, fine-tuning, and generation | Back to Top

List of Synthetic Data Startups in 2021 - Not all of these necessarily have APIs.

Prominent Synthetic Data Research Labs

Datasets

HuggingFace Datasets - Library for easily accessing and sharing datasets, and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks.
Google Cloud Public Datasets - Publicly available and free machine learning and analytics datasets.
Kaggle Datasets - Data science and machine learning datasets.
/r/datasets - A place to share, find, and discuss Datasets.
Papers with Code - Datasets - The mission of Papers with Code is to create a free and open resource with Machine Learning papers, code, datasets, methods and evaluation tables.
Awesome Public Datasets - Topic centric, high quality, public data sources
Data.gov - U.S. Government's open data
UC Irvine Machine Learning Repository - Popular datasets used by the machine learning community
Google Research Dataset Search - Discover datasets hosted in thousands of repositories across the web

License

License - CC0

gretelai / awesome-synthetic-data

awesome-synthetic-data

Contents

Research Summaries and Trends

Tutorials

Reading Content

Videos and Online Courses

Diffusion Models

Libraries

Text, Tabular and Time-Series

Image

Audio

Simulation

Video

Academic Papers

Language Models

Generative Adversarial Networks (GANs)

Diffusion Models

Fair AI

Algorithmic Privacy

Services

Prominent Synthetic Data Research Labs

Datasets

License

About