trian model CUDA out of memory
aoyang-hd opened this issue · comments
Is there any way to train on 24G on a GTX3090, even with one batch size?
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 3; 23.69 GiB total capacity; 23.03 GiB already allocated; 21.69 MiB free; 23.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Epoch 0: 0%| | 2/35135 [00:29<144:16:07, 14.78s/it, loss=0.389, v_num=0, train/loss_simple_step=0.131, train/loss_vlb_step=0.000475, train/loss_step=0.131, global_step=0.000, train/loss_x0_step=0.335, train/loss_x0_from_tao_step=0.366, train/loss_noise_from_tao_step=0.00291, train/loss_net_step=0.704]
Hello, you can try fp16 for training
@aoyang-hd @cswry @jfischoff I wanted to ask if you run it successfully on a single GPU. I'd appreciate it if you could reply to me.
yes, I just had to reduce the batch size
@jfischoff How long did it take you to complete the training?(●'◡'●)
I didn't run the complete training like that. I just did a test. I think it took 2 days on A100 8x
Thank you for responding.😊