A curated list of awesome synthetic data tools (open source and commercial).
Inspired by Awesome Synthetic Data
- Copulas: a Python library for modeling multivariate distributions and sampling from them using copula functions.
- CTGAN: SDV’s collection of deep learning-based synthetic data generators for single table data.
- DataGene: a tool to train, test, and validate datasets, detect and compare dataset similarity between real and synthetic datasets.
- DoppelGANger: a synthetic data generation framework based on generative adversarial networks (GANs).
- DP_WGAN-UCLANESL: this solution trains a Wasserstein generative adversarial network (w-GAN) that is trained on the real private dataset.
- DPSyn: an algorithm for synthesizing microdata while satisfying differential privacy.
- Faker: a Python package that generates fake data (Note: this tool does not generate synthetic data but offers dummy data).
- Generative adversarial nets for synthetic time series data: a repository that shows how to create synthetic time-series data using generative adversarial networks (GANs).
- Gretel.ai: commercial synthetic data vendor that offers open source functionality.
- mirrorGen: a python tool that generates synthetic data based on user-specified causal relations among features in the data.
- Plait.py: a program for generating fake data from composable yaml templates.
- Pydbgen: a Python package that generates a random database table based on the user's choice of data types.
- Smart noise synthesizer: a differentially private open source synthesizer for tabular data.
- Synner: an open source tool to generate real-looking synthetic data by visually specifying the properties of the dataset.
- Synth: an open source data-as-code tool that provides a simple CLI workflow for generating consistent data in a scalable way.
- Synthea: an open source synthetic patient generator that models the medical history of synthetic patients.
- Synthetic data vault (SDV): one of the first open source synthetic data solutions, SDV provides tools for generating synthetic data for tabular, relational, and time series data.
- TGAN: generative adversarial training for generating synthetic tabular data.
- Tofu: a Python library for generating synthetic UK Biobank data.
- Twinify: a software package for privacy-preserving generation of a synthetic twin to a given sensitive data set.
- YData: synthetic structured data generator by YData, a commercial vendor.
- Betterdata: vendor of a privacy-preserving synthetic data solution for AI, data sharing, or product development.
- Datomize: vendor of a synthetic data solution for the development, training and testing of AI/ML models, and applications.
- Diveplane: vendor of Geminai, a solution to generate synthetic ‘twin’ datasets with the same statistical properties as the original data.
- Facteus: vendor of Mimic™ a synthetic data engine to synthesize data assets that protect consumer privacy.
- Gretel: vendor of a synthetic data generation library and APIs for developers and data practitioners.
- Hazy: vendor of a synthetic data platform for financial institutions that want to conduct data analysis.
- Instill AI: vendor of a solution for synthetic data generation leveraging Generative Adversarial Networks and differential privacy.
- Kymera Labs: vendor Synthetic Data Fabrication Software, a solution that generates new data without relying on the ML/GAN approach.
- Mirry.ai: vendor of a synthetic data platform for generating synthetic data using GANs, available in Community, Cloud or Enterprise editions.
- Mostly AI: vendor of Mostly Generate, a synthetic data generator that provides as-good-as-real, yet fully anonymous data.
- Replica Analytics: vendor of Replica Synthesis, a software solution that ingests data and builds synthesis models to generate synthetic datasets.
- Sarus technologies: vendor of ML software to help data practitioners leverage sensitive data assets for innovation with privacy guarantees.
- Sogeti: vendor of Artificial Data Amplifier (ADA), a solution by the Sogeti Testing AI team that generates realistic data based on real data sets.
- Statice: vendor of a software solution that generates privacy-preserving synthetic data that can be used as a drop-in replacement for an original dataset.
- Syndata AB: vendor of a synthetic data generator to generate data sets that match the statistical attributes of real data but are entirely synthetic.
- Synthesized: vendor of a DataOps platform enabling data sharing and collaboration across internal groups, remote teams, and external partners.
- Syntheticus: Swiss vendor of a Swiss platform dedicated to generating synthetic data.
- Syntho: vendor of AI software for generating synthetic data.
- Tonic: vendor of a synthetic data generator to mimic production data.
- Ydata: vendor of a synthesizer that mimics statistical information from real data and on new datasets without transforming the original data.
- Open SDP: an open online community for sharing educational analytic tools and resources.
- OpenSynthetic: an open community for creating and using synthetic data in computer vision and machine learning (ML).
- GenRocket Community community from GenRocket to ask questions and exchange ideas around test data and synthetic data.
- Synthetic Data Vault Slack channel: the Slack channel from the SDV team.