batch size issue when trianing custom dataset

Question

batch size issue when trianing custom dataset

sungmin9939 opened this issue 2 months ago · comments

I'm trying to train the model with a custom dataset on 4 a6000(49GB each) gpus but it takes 27GB each when training the model with batchsize 1
here is my config file and gpu status
`model:
base_learning_rate: 1.0e-04
target: ldm.models.diffusion.ddpm.LatentDiffusion
params:
linear_start: 0.00085
linear_end: 0.0120
num_timesteps_cond: 1
log_every_t: 200
timesteps: 1000
first_stage_key: "image_target"
cond_stage_key: "image_cond"
image_size: 32
channels: 4
cond_stage_trainable: false # Note: different from the one we trained before
conditioning_key: hybrid
monitor: val/loss_simple_ema
scale_factor: 0.18215

scheduler_config: # 10000 warmup steps
  target: ldm.lr_scheduler.LambdaLinearScheduler
  params:
    warm_up_steps: [ 100 ]
    cycle_lengths: [ 10000000000000 ] # incredibly large number to prevent corner cases
    f_start: [ 1.e-6 ]
    f_max: [ 1. ]
    f_min: [ 1. ]
unet_config:
  target: ldm.modules.diffusionmodules.openaimodel.UNetModel
  params:
    image_size: 32 # unused
    in_channels: 8
    out_channels: 4
    model_channels: 320
    attention_resolutions: [ 4, 2, 1 ]
    num_res_blocks: 2
    channel_mult: [ 1, 2, 4, 4 ]
    num_heads: 8
    use_spatial_transformer: True
    transformer_depth: 1
    context_dim: 768
    use_checkpoint: True
    legacy: False
first_stage_config:
  target: ldm.models.autoencoder.AutoencoderKL
  params:
    embed_dim: 4
    monitor: val/rec_loss
    ddconfig:
      double_z: true
      z_channels: 4
      resolution: 256
      in_channels: 3
      out_ch: 3
      ch: 128
      ch_mult:
      - 1
      - 2
      - 4
      - 4
      num_res_blocks: 2
      attn_resolutions: []
      dropout: 0.0
    lossconfig:
      target: torch.nn.Identity
cond_stage_config:
  target: ldm.modules.encoders.modules.FrozenCLIPImageEmbedder

data:
target: ldm.data.simple.ObjaverseDataModuleFromConfig
params:
root_dir: my_path
batch_size: 1
num_workers: 8
total_view: 4
train:
validation: False
image_transforms:
size: 256
validation:
validation: True
image_transforms:
size: 256
lightning:
find_unused_parameters: false
metrics_over_trainsteps_checkpoint: True
modelcheckpoint:
params:
every_n_train_steps: 5000
callbacks:
image_logger:
target: main.ImageLogger
params:
batch_frequency: 500
max_images: 32
increase_log_steps: False
log_first_step: True
log_images_kwargs:
use_ema_scope: False
inpaint: False
plot_progressive_rows: False
plot_diffusion_rows: False
N: 32
unconditional_guidance_scale: 3.0
unconditional_guidance_label: [""]
trainer:
benchmark: True
val_check_interval: 5000000 # really sorry
num_sanity_val_steps: 0
accumulate_grad_batches: 5
Wed Apr 24 06:47:00 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:1D:00.0 Off | Off |
| 48% 71C P2 203W / 300W | 27238MiB / 49140MiB | 92% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:1E:00.0 Off | Off |
| 46% 70C P2 204W / 300W | 27242MiB / 49140MiB | 93% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX A6000 Off | 00000000:1F:00.0 Off | Off |
| 49% 73C P2 202W / 300W | 27242MiB / 49140MiB | 94% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A6000 Off | 00000000:20:00.0 Off | Off |
| 47% 70C P2 194W / 300W | 27222MiB / 49140MiB | 94% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+`
Is it normal for batch size 1 to consume this much GPU?