Two A100 GPUs for fine-tuning, insufficient memory?
hekaijie123 opened this issue · comments
Hello Author, I am using two A100 GPUs (40GB each) to run full-parameter fine-tuning. Regardless of whether I use the zero2 or zero3 configuration, it always shows that the memory is exceeded. However, according to the "Model Fine-tuning Memory Usage Statistics" table you provided below, GPUs2 uses 16GB of memory. How can this be resolved?
have you set offload the model‘s params and optimizer to cpu?
You need to offload model parameters and optimizer parameters to the CPU, further reducing GPU memory usage:
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
}
}