How to Evaluate a Fine-Tuned Model

Question

How to Evaluate a Fine-Tuned Model

HuBocheng opened this issue 2 months ago · comments

I have modified the LazySupervisedDataset and fine-tuned the model using a new dataset. The final output of the training appears to be quite normal:

{'loss': 0.6775, 'grad_norm': 1.409849206533307, 'learning_rate': 5.2443095448506674e-08, 'epoch': 0.99}
{'loss': 0.8663, 'grad_norm': 1.7525256828314943, 'learning_rate': 2.9500369369195312e-08, 'epoch': 0.9925}
{'loss': 1.0061, 'grad_norm': 1.4918140105277027, 'learning_rate': 1.3111633436779791e-08, 'epoch': 0.995}
{'loss': 1.0717, 'grad_norm': 2.0363632906847373, 'learning_rate': 3.2779620843692572e-09, 'epoch': 0.9975}
{'loss': 1.2656, 'grad_norm': 1.821970205229175, 'learning_rate': 0.0, 'epoch': 1.0}
{'train_runtime': 5472.9931, 'train_samples_per_second': 2.337, 'train_steps_per_second': 0.073, 'train_loss': 1.1184636522829532, 'epoch': 1.0}
[2024-06-02 18:23:00,818] [INFO] [launch.py:351:main] Process 206306 exits successfully.
[2024-06-02 18:23:01,820] [INFO] [launch.py:351:main] Process 206305 exits successfully.

Subsequently, I obtained the checkpoint data at Bunny/checkpoints-phi-2/bunny-lora-phi-2:

ls
README.md            adapter_model.safetensors  log.txt                  trainer_state.json
adapter_config.json  config.json                non_lora_trainables.bin

I would like to conduct some benchmark tests on Bunny using the fine-tuned model. I have been using scripts and code such as script/eval/full/mmbench.sh and bunny/eval/model_vqa_mmbench.py variants. However, the checkpoint data I obtained contains adapter_model.safetensors, which does not include the entire model parameters.

Could you please advise on how to save the entire model after training, similar to BAAI/Bunny-v1_0-3B · Hugging Face, so that I can further evaluate the fine-tuned model's performance on various benchmarks?

Thank you very much for your assistance!

BochengHu · Answer 1 · Mon Jun 03 2024 12:56:51 GMT+0800 (China Standard Time)

I have figured out how to perform the testing. I just need to use script/merge_lora_weights.py to merge the unmerged weights. It seems I missed the information in evaluate.md. I apologize for wasting your time. 🫣