Memory consumption in parallel execution using CPU

Question

Memory consumption in parallel execution using CPU

rafamarquesi opened this issue a year ago · comments

Rafael Rodrigues Marquesi commented a year ago

Hello! I don't have a lot of experience, especially with deep learning algorithms. I am in need of help with running TabNet. I'm using pytorch-tabnet==4.0.

The dataset:
x_train shape: (2378460, 30)
y_train shape: (2378460,)
x_test shape: (594616, 30)
y_test shape: (594616,)

I'm using the scikit learn library to perform the grid search, as I'm also testing algorithms other than TabNet. For that, I followed the suggestion found in this issue.

The parameters to be adjusted in grid search are:

{
                    'classifier__estimator': [
                        TabNetClassifierTuner(
                            device_name='cpu',
                            use_embeddings=True,
                            threshold_categorical_features=150,
                            use_cat_emb_dim=True
                        )
                    ],
                    'classifier__estimator__seed': [settings.random_seed],
                    'classifier__estimator__clip_value': [1],
                    'classifier__estimator__verbose': [1],
                    'classifier__estimator__optimizer_fn': [torch.optim.Adam],
                    # 'classifier__estimator__optimizer_params': [dict(lr=2e-2)],
                    'classifier__estimator__optimizer_params': [
                        {'lr': 0.02},
                        {'lr': 0.01},
                        {'lr': 0.001}
                    ],
                    'classifier__estimator__scheduler_fn': [torch.optim.lr_scheduler.StepLR],
                    'classifier__estimator__scheduler_params': [{
                        'step_size': 10,  # how to use learning rate scheduler
                        'gamma': 0.95
                    }],
                    'classifier__estimator__mask_type': ['sparsemax'],
                    'classifier__estimator__n_a': [8, 64],
                    'classifier__estimator__n_steps': [3, 10],
                    'classifier__estimator__gamma': [1.3, 2.0],
                    'classifier__estimator__cat_emb_dim': [10, 20],
                    'classifier__estimator__n_independent': [2, 5],
                    'classifier__estimator__n_shared': [2, 5],
                    'classifier__estimator__momentum': [0.02, 0.4],
                    'classifier__estimator__lambda_sparse': [0.001, 0.1]
 }

The fit parameters:

super().fit(
            X_train=X_train,
            y_train=y_train,
            eval_set=[(X_train, y_train), (X_valid, y_valid)],
            eval_name=['train', 'valid'],
            eval_metric=['auc', 'balanced_accuracy', 'accuracy'],
            weights=0,
            max_epochs=1000,
            patience=20,
            batch_size=16384,
            virtual_batch_size=2048,
            num_workers=0,
            drop_last=False,
            augmentations=None
        )

These settings are for running the experiment for a binary classification. I also intend to run an experiment for multiclass classification.

I created a smaller sample of the dataset, with almost 50k records, for binary classification, and it worked great. I got the grid search result. Using CPU, it took more than a day, with 37 CPUs and 45G of memory. During this experiment, TabNet's grid search started consuming just over 15G of memory. Over time it went up until it stayed for a long time, and ended up running with just over 40G. Grid search was configured with n_jobs=37 and pre_dispatch=37 and in fit batch_size=1024, virtual_batch_size=128.

Since the full dataset detailed above is larger, I'm using another machine, with 120 CPUs and 115G of memory. However, after a few days, the execution ends due to lack of memory. I've already tried running the grid search with the parameters n_jobs and pre_dispatch equal to 20, 15 and 10. The last one, n_jobs and pre_dispatch equal to 10, after a week of execution, ended due to lack of memory space. I'm trying to run with n_jobs and pre_dispatch equal to 5 now.

For now I don't have access to GPU, I'm looking for this option.

The parallel execution of the grid search always starts consuming little memory. Over time it increases, as I reported. I don't know if I'm doing something wrong in the settings. Is there any way to control this? Any tips?

Thanks in advance for your attention.

Optimox · Answer 1 · Mon Apr 24 2023 17:17:19 GMT+0800 (China Standard Time)

I would advise you to get a GPU, it will make the training much much faster. So you won't have to parallelize training, you can just to you grid search one point at a time and it will still be much faster than training on CPU.

Rafael Rodrigues Marquesi · Answer 2 · Tue Apr 25 2023 02:09:22 GMT+0800 (China Standard Time)

I thought I was doing some configuration wrong. I am trying to acquire access to GPUs. Thank you, @Optimox.