GLM-10B 模型效率问题

Question

GLM-10B 模型效率问题

tqjack opened this issue a year ago · comments

同样任务在相同的硬件条件下（八卡3090），hugging face的glm-10b用accelerate（deepspeed）在zore-3模式+cpuoffload的设置下最大batchsize为4，max token length为384；而使用本仓库源码使用zero3+cpuoffload并不使用tensor parallelism（mp=1）可以跑到batchsize24，max token length 384，请问有什么可能的原因造成这样的差异吗