how to continue model training?

Question

how to continue model training?

phamkhactu opened this issue 8 months ago · comments

phamkhactu commented 8 months ago

Check before submitting issues

Make sure to pull the latest code, as some issues and bugs have been fixed.
Due to frequent dependency updates, please ensure you have followed the steps in our Wiki
I have read the FAQ section AND searched for similar issues and did not find a similar problem or solution
Third-party plugin issues - e.g., llama.cpp, text-generation-webui, LlamaChat, we recommend checking the corresponding project for solutions
Model validity check - Be sure to check the model's SHA256.md. If the model is incorrect, we cannot guarantee its performance

Type of Issue

Model training and fine-tuning

Base Model

LLaMA-7B

Operating System

Linux

Describe your issue in detail

I train model using run_clm_pt_with_peft.py , but my machine shutdown suddenly, model had trained some step. Now I want to resume from checkpoint lora to continue training. I've read the readme, I not found anything.

Many thanks for your help.

Dependencies (must be provided for code-related issues)

No response

Execution logs or screenshots

No response

Gokul NC (Sarvam.AI) · Answer 1 · Thu Nov 09 2023 13:58:13 GMT+0800 (China Standard Time)

Hi @phamkhactu, how did you solve the problem?

phamkhactu · Answer 2 · Mon Nov 13 2023 10:12:43 GMT+0800 (China Standard Time)

Hi @phamkhactu, how did you solve the problem?

Hi @GokulNC-Sarvam, I use trainer and I resume from checkpoint

    trainer.train(resume_from_checkpoint=resume_from_checkpoint)