显存占用问题

Question

显存占用问题

lelegogo26 opened this issue a year ago · comments

用https://github.com/mymusise/ChatGLM-Tuning，单卡batchsize=16，gradient_accumulation_steps=1 可以正常训练。
但是本项目，per_device_train_batch_size=16，gradient_accumulation_steps=1，用multi_gpu_fintune_belle.py两张卡报显存溢出，

calebgithub · Answer 1 · Tue Apr 25 2023 15:43:45 GMT+0800 (China Standard Time)

同样的问题，多卡的时候应该是占用显存*N了，导致我的长度只能设置最长512，目前还没有找到问题

liangwq · Answer 2 · Tue Apr 25 2023 18:24:47 GMT+0800 (China Standard Time)

用https://github.com/mymusise/ChatGLM-Tuning，单卡batchsize=16，gradient_accumulation_steps=1 可以正常训练。但是本项目，per_device_train_batch_size=16，gradient_accumulation_steps=1，用multi_gpu_fintune_belle.py两张卡报显存溢出，

现在用的是deepspeed2，所以他们中间过程是缓存下来的，每个卡16条数据训练，要等对方算法做数据参数同步，所以你可以简单理解一张卡中间过程缓存量是远大于16的，相当于是一次算16条数据，用多卡你可以吧把每个机器上batch调小，训练速度不会慢的，2张卡相当于两倍速度在训练数据