A mini demo training process requires only 1.62GB of GPU memory (any consumer-grade GPU). Increase the capacity of the fine-tuning model by up to 3.7 times on a single GPU.
The LLaMA 33B (LoRA) performance is achieved with only ~16h finetuning on the training split of PubMedQA and MedMCQA with a single 8 x A100 server. GPT-neo-2.7b model can be train on a single RTX3090(24G) but it is a rather weak model, which only supports English and may sometimes generate unsatisfactory responses.
A10 GPUs(The 6.9B and 2.8B param models should work as-is). V100 GPUs(When using V100s (ex: p3.2xlarge, 1 x V100 16GB, NC6s_v3), in all cases, set torch_dtype=torch.float16 in pipeline() instead).
Vision CAIR Group, KAUST, supported by Mohamed Elhoseiny
To save GPU memory, Vicuna loads as 8 bit by default, with a beam search width of 1. This configuration requires about 23G GPU memory for Vicuna 13B and 11.5G GPU memory for Vicuna 7B.
DeepSpeed empowers ChatGPT-like model training with a single click, offering 15x speedup over SOTA RLHF systems with unprecedented cost reduction at all scales.
Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales. (See the following table for the E2E time breakdown for training a 1.3 billion parameter ChatGPT model via DeepSpeed-Chat on a single commodity NVIDIA A6000 GPU with 48GB memory, in 2.2hr.)
Microsoft
各大公司LLM模型汇总
Google的LaMDA(137B)和PaLM(540B),DeepMind的Gopher(280B)、BigScience的BLOOM(175B)、Meta的OPT(175B)、Nvidia的TNLG v2(530B)以及清华大学的GLM-130B(130B)、OpenAI的GTP-1(0.117B)/GPT-2(1.5B)/GPT-3(175B)/GPT-3.5(175B)/GPT-4(未公开,猜测为100 x 1000B)、OpenAI开放的finetune模型Ada(0.35B)/Babbage(1.3B)/Curie(6.7B)/Davinci(175B)。