trian model CUDA out of memory

Question

trian model CUDA out of memory

aoyang-hd opened this issue 6 months ago · comments

Is there any way to train on 24G on a GTX3090, even with one batch size?

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 3; 23.69 GiB total capacity; 23.03 GiB already allocated; 21.69 MiB free; 23.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Epoch 0: 0%| | 2/35135 [00:29<144:16:07, 14.78s/it, loss=0.389, v_num=0, train/loss_simple_step=0.131, train/loss_vlb_step=0.000475, train/loss_step=0.131, global_step=0.000, train/loss_x0_step=0.335, train/loss_x0_from_tao_step=0.366, train/loss_noise_from_tao_step=0.00291, train/loss_net_step=0.704]

Rongyuan Wu · Answer 1 · Thu Feb 08 2024 23:01:54 GMT+0800 (China Standard Time)

Hello, you can try fp16 for training

Jonathan Fischoff · Answer 2 · Fri Feb 09 2024 00:05:13 GMT+0800 (China Standard Time)

reduce the batch sizes. It is harcoded to 16 but you can reduce them.

…

On Thu, Feb 8, 2024 at 7:02 AM Rongyuan Wu ***@***.***> wrote: Hello, you can try fp16 for training — Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABJEO7ZSPT6W3DCG7LBHN3YSTSGZAVCNFSM6AAAAABCEYQOZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZUGMYDKMJZGU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

zhouyizhuo · Answer 3 · Fri Mar 01 2024 10:53:37 GMT+0800 (China Standard Time)

@aoyang-hd @cswry @jfischoff I wanted to ask if you run it successfully on a single GPU. I'd appreciate it if you could reply to me.

Jonathan Fischoff · Answer 4 · Wed Mar 06 2024 06:55:35 GMT+0800 (China Standard Time)

yes, I just had to reduce the batch size

zhouyizhuo · Answer 5 · Wed Mar 06 2024 08:32:13 GMT+0800 (China Standard Time)

@jfischoff How long did it take you to complete the training?(●'◡'●)

Jonathan Fischoff · Answer 6 · Wed Mar 06 2024 08:33:40 GMT+0800 (China Standard Time)

I didn't run the complete training like that. I just did a test. I think it took 2 days on A100 8x

zhouyizhuo · Answer 7 · Wed Mar 06 2024 08:59:22 GMT+0800 (China Standard Time)

Thank you for responding.😊