sdv-dev / CTGAN

Conditional GAN for generating synthetic tabular data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about large amount of training dataset in TVAE -- is there max?

koseoyoung opened this issue · comments

Environment details

If you are already running CTGAN, please indicate the following details about the environment in
which you are running it:

  • CTGAN version: SDV 1.2.0
  • Python version: 3.10.11
  • Operating System: Mac M1 (CPU)

Problem description

I'm wondering if there is any max length of the training dataset for TVAE (for dataset fitting).
I've tried a large dataset, but it seems like it takes too long, although the epoch is specified as 1.
The dataset size was around 80 MB, and I was running the code with CPU. (keep running more than 1 hr -- and not able to see any logs)
Is it expected behavior? Since there is no verbose option, debugging whether it's working on training or having some error is hard.

Thank you! : )

What I already tried

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

Hi @koseoyoung, nice to meet you.

As long as your code isn't crashing, it should still be running as intended. Even though you have only specified 1 epoch, TVAE still uses the batch_size parameter to iterate through different portions of your data. This may be taking some time.

I understand the frustration of not having a verbose option, so I've added #300 as a proposed feature request.

I'm wondering if there is any max length of the training dataset for TVAE (for dataset fitting).

While there is no theoretical max length, you may find certain dataset sizes infeasible for the computational power that you have. For GAN-based synthesizers, many users report needing a few hours.

The dataset size was around 80 MB, and I was running the code with CPU.

If possible, running on a GPU might be a good option. Alternatively, you can subsample your data for training purposes. The important thing is to make sure your subsample contains the patterns you are trying to learn. For example, all the possible categories, a large range of numerical values, etc.

Marking this issue as resolved since it has been inactive for some time. The good news is that the feature in #300 has been added, so you can now view the progress bar to track estimated time.

If you have additional questions, please feel free to file a new issue. Thanks.