zealscott / SynMeter

A principled library for tuning, training and evaluating tabular data synthesis on fidelity, privacy and utility.

Home Page:https://arxiv.org/abs/2402.06806

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


Generated by DALLΒ·E 3

A principled library for tuning, training, and evaluating tabular data synthesis. [Technical Report]

Why SynMeter:

  • πŸ’« Easy to add new synthesizers, seamlessly tuning, training, and evaluating various synthesizers.
  • πŸŒ€ principled evaluation metrics for fidelity, privacy, and utility.
  • πŸ”₯ Several SoTA synthesizers, by type:
    • Statistical methods: PGM, PrivSyn
    • GAN-based: CTGAN, PATE-GAN
    • VAE-based: TVAE
    • Diffusion-based: TabDDPM, TableDiffusion
    • LLM-based: GReaT

πŸš€ Installation

Create a new conda environment and setup:

conda create -n synmeter python==3.9
conda activate synmeter
pip install -r requirements.txt # install dependencies
pip install -e . # package the library

Change the base dictionary in ./lib/info/ROOT_DIR:

ROOT_DIR = root_to_synmeter

πŸ’₯ Usage

Datasets

  • SynMeter provides 12 standardized datasets with train/val/test datasets for benchmarking, which can be downloaded from here: Google Drive
  • You can also easily use an additional dataset by putting it to ./dataset.

Tune evaluators for utility evaluations

  • Machine learning affinity requires machine learning models with tuned hyperparameters, SynMeter provides 8 commonly-used machine learning models and their configurations in ./exp/evaluators.
  • You can tune these evaluators on your customized dataset:
python scripts/tune_evaluator.py -d [dataset] -c [cuda]

Tune synthesizer

We provide a unified tuning objective for model tuning, thus, all kinds of synthesizers can be tuned by just a single command:

python scripts/tune_synthesizer.py -d [dataset] -m [synthesizer] -s [seed] -c [cuda]

Train synthesizer

After tuning, a configuration should be recorded to /exp/dataset/synthesizer, SynMeter can use it to train and store the synthesizer:

python scripts/train_synthesizer.py -d [dataset] -m [synthesizer] -s [seed] -c [cuda]

Evaluate synthesizer

Assessing the fidelity of the synthetic data:

python scripts/eval_fidelity.py -d [dataset] -m [synthesizer] -s [seed] -t [target] 

Assessing the privacy of the synthetic data:

python scripts/eval_privacy.py -d [dataset] -m [synthesizer] -s [seed]

Assessing the utility of the synthetic data:

python scripts/eval_utility.py -d [dataset] -m [synthesizer] -s [seed]

The results of the evaluations should be saved under the corresponding dictionary /exp/dataset/synthesizer.

πŸ“– Customize your own synthesizer

One advantage of SynMeter is to provide the easiest way to add new synthesis algorithms, three steps are needed:

  1. Write new synthesis code in modularity into ./synthesizer/my_synthesiszer
  2. Create a base configuration in ./exp/base_config.
  3. Create a calling python function in ./synthesizer, which contain three functions: train, sample, and tune.

Then, you are free to tune, run, and test the new synthesizer!

πŸ”‘ Methods

Statistical methods

Method Type Description Reference
PGM DP The method uses probabilistic graphical models to learn the dependence of low-dimensional marginals for data synthesis. Paper
PrivSyn DP A non-parametric DP synthesizer, which iteratively updates the synthetic dataset to make it match the target noise marginals. Paper

Generative adversarial networks (GANs)

Method Type Description Reference
CTGAN HP A conditional generative adversarial network that can handle tabular data. Paper
PATE-GAN DP The method uses the Private Aggregation of Teacher Ensembles (PATE) framework and applies it to GANs. Paper

Variational autoencoders (VAE)

Method Type Description Reference
TVAE HP A conditional VAE network which can handle tabular data. Paper

Diffusion models

Method Type Description Reference
TabDDPM HP TabDDPM: Modelling Tabular Data with Diffusion Models. Paper
TableDiffusion DP Generating tabular datasets under differential privacy. Paper

Large Language Model (LLM)-based models

Method Type Description Reference
GReaT HP Language Models are Realistic Tabular Data Generators Paper

⚑ Evaluation metrics

  • Fidelity metrics: we consider the Wasserstein distance as a principled fidelity metric, which is calculated by all one and two-way marginals.

  • Privacy metrics: we devise a Membership Disclosure Score (MDS) to measure the membership privacy risks of synthesizers.

  • Utility metrics: we use machine learning affinity and query error to measure the utility of synthetic data.

Please see our paper for details and usages.

🌈 Acknowledge

Many excellent synthesis algorithms and open-source libraries are used in this project:

About

A principled library for tuning, training and evaluating tabular data synthesis on fidelity, privacy and utility.

https://arxiv.org/abs/2402.06806

License:Apache License 2.0


Languages

Language:Python 100.0%