csslc / CCSR

Official codes of CCSR: Improving the Stability of Diffusion Models for Content Consistent Super-Resolution

Home Page:https://csslc.github.io/project-CCSR/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

trian model CUDA out of memory

aoyang-hd opened this issue · comments

Is there any way to train on 24G on a GTX3090, even with one batch size?

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 3; 23.69 GiB total capacity; 23.03 GiB already allocated; 21.69 MiB free; 23.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Epoch 0: 0%| | 2/35135 [00:29<144:16:07, 14.78s/it, loss=0.389, v_num=0, train/loss_simple_step=0.131, train/loss_vlb_step=0.000475, train/loss_step=0.131, global_step=0.000, train/loss_x0_step=0.335, train/loss_x0_from_tao_step=0.366, train/loss_noise_from_tao_step=0.00291, train/loss_net_step=0.704]

Hello, you can try fp16 for training

@aoyang-hd @cswry @jfischoff I wanted to ask if you run it successfully on a single GPU. I'd appreciate it if you could reply to me.

yes, I just had to reduce the batch size

@jfischoff How long did it take you to complete the training?(●'◡'●)

I didn't run the complete training like that. I just did a test. I think it took 2 days on A100 8x

Thank you for responding.😊