sdv-dev / CTGAN

Conditional GAN for generating synthetic tabular data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deployment requirements based on libtorch or ONNX

wangrui9720 opened this issue · comments

After training ctgan, we hope to use C++ to call this model to work in real time. After trying, ctgan can't be deployed in torchscript and other formats, because the input and output data of ctgan are based on python's pandas library, while the input and output of libtorch are required to be in tensor format. We really need to provide a deployment method based on C++, which can improve the efficiency of software operation. We look forward to your proposal!

Hi @wangrui9720! It’s great to see your interest in the SDV ecosystem. This comment is a reminder to consult your legal before adopting the SDV into your project, as SDV (and most of the related libraries such as CTGAN) has source-available, BSL license.

For more information, you can read through our license FAQs (not legal advice) or our blog. For any other questions, please refer to our Support Page. You can also inquire about a commercial license to allow additional use.

Hi there @wangrui9720 do you mind sharing a bit more about your use case? A few suggestions to consider:

  • GaussianCopulaSynthesizer, from SDV, is an alternative model that is significantly faster than our GAN based models like CTGAN. SDV is our batteries-included framework that sits one level above CTGAN and offers a better user experience.
  • To speed up CTGAN model training time, you can often get very good synthetic data quality with less rows than you think. You can read more about our thinking and advice here.

Hi there @wangrui9720 do you mind sharing a bit more about your use case? A few suggestions to consider:

  • GaussianCopulaSynthesizer, from SDV, is an alternative model that is significantly faster than our GAN based models like CTGAN. SDV is our batteries-included framework that sits one level above CTGAN and offers a better user experience.
  • To speed up CTGAN model training time, you can often get very good synthetic data quality with less rows than you think. You can read more about our thinking and advice here.

This is the code that I call the trained ctgan model.

from ctgan import CTGAN
import pandas as pd

def load_ctgan_model():
model_path = 'Z:/project/pkl/ctgan-test.pkl'
ctgan = CTGAN.load(model_path)
return ctgan

def get_welding_parameters(ctgan, NG_piece, desired_rows=500, batch_size=100):

conditioned_data_list = []

while len(conditioned_data_list) < desired_rows:
   
    generated_data = ctgan.sample(batch_size)

    new_data = generated_data[generated_data[slice] == NG_piece]
 
    conditioned_data_list.extend(new_data.values)


conditioned_data = pd.DataFrame(conditioned_data_list, columns=generated_data.columns)

if len(conditioned_data) > desired_rows:
    conditioned_data = conditioned_data.iloc[:desired_rows]

average_welding_time = conditioned_data[time(ms)].mean()
average_welding_temp = conditioned_data[temp(℃)'].mean()

return average_welding_time, average_welding_temp

When I want to deploy the trained ctgan code for real-time output, I can only call this python code with c++. The Gaussiancoupulaasynthesizer you mentioned is also the python code that needs me to call Gaussiancoupulaasynthesizer with c++ to train, right? Looking forward to your reply!

Ah now I understand @wangrui9720 you're correct that CTGAN and SDV don't actually currently support portability of just the machine learning model. The pkl file also contain a lot of Python library context because all that context is usually needed to run the Synthesizer capabilities to generate synthetic data.

We have a feature request issue in SDV to enable the exporting of just the model weights: sdv-dev/SDV#1970

I'll close this issue off and will add your use case over there so we can collect more examples for the team to prioritize! Thanks!