How Can I track model loss and accuracy of each epoch during fine-tuning, to make sure model is stable?

Question

How Can I track model loss and accuracy of each epoch during fine-tuning, to make sure model is stable?

XuanrZhang opened this issue a year ago · comments

hi @Zhihan1996 ，

Thanks for developing this useful tool.

I used your pre-trained DNA6M model to fine-tune my datasets to do binary classification; it takes 16 hours for a very small dataset. And I tried to use GPU, but it didn't work. Any suggestion for using GPU?

And also, I would like to know where I can find the training log file, to track loss and accuracy during fine-tuning.
After model training, here are all the files I got. I need to plot model loss to see whether the model is trained enough and stable.
├── config.json
├── eval_results.txt
├── pytorch_model.bin
├── special_tokens_map.json
├── tokenizer_config.json
├── training_args.bin
└── vocab.txt

I do fine-tune by using the below parameters.

python3 /g/data/zk16/xzhang/DNABERT/examples/run_finetune.py \
--model_type dnalongcat \
--tokenizer_name=/g/data/zk16/xzhang/DNABERT/pre_model/6-new-12w-0/vocab.txt \
--model_name_or_path $MODEL_PATH \
--task_name dnaprom \
--do_train \
--do_eval \
--data_dir $DATA_PATH \
--max_seq_length 1536 \
--per_gpu_train_batch_size=32   \
--per_gpu_eval_batch_size=32  \
--learning_rate 2e-4 \
--num_train_epochs 5.0 \
--output_dir $OUTPUT_PATH \
--evaluate_during_training \
--logging_steps 100 \
--save_steps 4000 \
--warmup_percent 0.1 \
--hidden_dropout_prob 0.1 \
--overwrite_output \
--weight_decay 0.01 \
--n_process 8

I really appreciate any help you can provide.

Best,
Xuan