Resuming pretraining from checkpoint
lathashree01 opened this issue · comments
Check before submitting issues
- Make sure to pull the latest code, as some issues and bugs have been fixed.
- Due to frequent dependency updates, please ensure you have followed the steps in our Wiki
- I have read the FAQ section AND searched for similar issues and did not find a similar problem or solution
- Third-party plugin issues - e.g., llama.cpp, text-generation-webui, LlamaChat, we recommend checking the corresponding project for solutions
- Model validity check - Be sure to check the model's SHA256.md. If the model is incorrect, we cannot guarantee its performance
Type of Issue
Model training and fine-tuning
Base Model
LLaMA-7B
Operating System
Linux
Describe your issue in detail
I am doing pretraining of the LLAMA 7B model, but due to time limits in the cluster, it stopped. I restarted the program again.
I am getting error - "Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run."
In my ds config, I have min loss scale of 1e-10; my previous training loss stops at 1.092.
I have attached a snapshot of it. I am doing training in fp16, and due to hardware limitations, I can't run on bf16; the job on cluster fails with error.
What can I do further to solve this issue? Any help would be greatly appreciated.
Thanks.
Dependencies (must be provided for code-related issues)
my ds config:
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 100,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1e-10
},
"bf16": {
"enabled": false
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 1e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 1e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
Execution logs or screenshots
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.
Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.