Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space [Paper link]

Installing Dependencies

Python version: 3.10

Create environment

conda create -n tabsyn python=3.10
conda activate tabsyn

Install pytorch

conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia

Install other dependencies

pip install -r requirements.txt

Install dependencies for GOGGLE

pip install  dgl -f https://data.dgl.ai/wheels/cu117/repo.html

pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.0.1+cu117.html

Create another environment for the quality metric (package "synthcity")

conda create -n tabsyn python=3.10
conda activate tabsyn

pip install synthcity
pip install category_encoders

Preparing Datasets

Download raw dataset:

python download_dataset.py

Process dataset:

python process_dataset.py

Training Models

For baseline methods, use the following command for training:

python main.py --dataname [NAME_OF_DATASET] --method [NAME_OF_BASELINE_METHODS] --mode train

Options of [NAME_OF_DATASET]: adult, default, shoppers, magic, beijing, news Options of [NAME_OF_BASELINE_METHODS]: goggle, great, stasy, codi, tabddpm

For Tabsyn, use the following command for training:

# train VAE first
python main.py --dataname [NAME_OF_DATASET] --method vae --mode train

# after the VAE is trained, train the diffusion model
python main.py --dataname [NAME_OF_DATASET] --method tabsyn --mode train

Tabular Data Synthesis

For baseline methods, use the following command for synthesis:

python main.py --dataname [NAME_OF_DATASET] --method [NAME_OF_BASELINE_METHODS] --mode sample --save_path [PATH_TO_SAVE]

For Tabsyn, use the following command for synthesis:

python main.py --dataname [NAME_OF_DATASET] --method tabsyn --mode sample --save_path [PATH_TO_SAVE]

The default save path is "synthetic/[NAME_OF_DATASET]/[METHOD_NAME].csv"

Evaluation

Density estimation:

python eval/eval_density.py --dataname [NAME_OF_DATASET] --model [METHOD_NAME] --path [PATH_TO_SYNTHETIC_DATA]

Alpha Precision and Beta Recall:

python eval/eval_quality.py --dataname [NAME_OF_DATASET] --model [METHOD_NAME] --path [PATH_TO_SYNTHETIC_DATA]

Machine Learning Efficiency:

python eval/eval_mle.py --dataname [NAME_OF_DATASET] --model [METHOD_NAME] --path [PATH_TO_SYNTHETIC_DATA]

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Reference

We appreciate your citations if you find this repository useful to your research!

@article{zhang2023mixed,
  title={Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space},
  author={Zhang, Hengrui and Zhang, Jiani and Srinivasan, Balasubramaniam and Shen, Zhengyuan and Qin, Xiao and Faloutsos, Christos and Rangwala, Huzefa and Karypis, George},
  journal={arXiv preprint arXiv:2310.09656},
  year={2023}
}

Yidadaa / tabsyn