Finetuning Issue with Example Data

Question

Finetuning Issue with Example Data

mosala777 opened this issue 2 years ago · comments

Hello,

I tried finetuning the 6-mer model with the provided example data. Performance seems to be good at first, but then it suddenly drops significantly raising a warning. First few evaluations start normally and improve like this:

05/20/2022 19:32:46 - INFO - main - ***** Eval results *****
05/20/2022 19:32:46 - INFO - main - acc = 0.944
05/20/2022 19:32:46 - INFO - main - auc = 0.988568
05/20/2022 19:32:46 - INFO - main - f1 = 0.9439997759991039
05/20/2022 19:32:46 - INFO - main - mcc = 0.8880071040852491
05/20/2022 19:32:46 - INFO - main - precision = 0.9440071041136658
05/20/2022 19:32:46 - INFO - main - recall = 0.944

Then drop significantly and give the warning:

/home/maborageh/dnabert/lib/python3.6/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/home/maborageh/dnabert/lib/python3.6/site-packages/sklearn/metrics/_classification.py:873: RuntimeWarning: invalid value encountered in double_scalars
mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
05/20/2022 21:09:44 - INFO - main - ***** Eval results *****
05/20/2022 21:09:44 - INFO - main - acc = 0.5
05/20/2022 21:09:44 - INFO - main - auc = 0.59788
05/20/2022 21:09:44 - INFO - main - f1 = 0.3333333333333333
05/20/2022 21:09:44 - INFO - main - mcc = 0.0
05/20/2022 21:09:44 - INFO - main - precision = 0.25
05/20/2022 21:09:44 - INFO - main - recall = 0.5

I saw that some updates were made as mentioned in #10 but I'm still facing this issue. I would appreciate any feedback from you.

Kind regards,
Salah

WY · Answer 1 · Tue Dec 20 2022 15:45:30 GMT+0800 (China Standard Time)

Hi developers

I am facing the same issues as @mosala777. Here is part of the stdout I observed:

INFO - main - Loading features from cached file /home/weiyuan/Desktop/rbp/model/HNRNPA1/data_HNRNPA1/cached_dev_DNABERT3_101_dnaprom
12/20/2022 14:46:07 - INFO - main - ***** Running evaluation *****
12/20/2022 14:46:07 - INFO - main - Num examples = 5350
12/20/2022 14:46:07 - INFO - main - Batch size = 32
Evaluating: 100%|█████████████| 168/168 [03:38<00:00, 1.30s/it]
12/20/2022 14:49:46 - INFO - main - ***** Eval results *****luating: 100%|█████████████| 168/168 [03:38<00:00, 1.01it/s]
12/20/2022 14:49:46 - INFO - main - acc = 0.7816822429906543
12/20/2022 14:49:46 - INFO - main - auc = 0.8839985990874502
12/20/2022 14:49:46 - INFO - main - f1 = 0.7782037084362665
12/20/2022 14:49:46 - INFO - main - mcc = 0.5846169035386265
12/20/2022 14:49:46 - INFO - main - precision = 0.8024470439531897
12/20/2022 14:49:46 - INFO - main - recall = 0.7825097242114136
/home/weiyuan/mambaforge/envs/dnabert/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:247: UserWarning: To get the last learning rate computed by the scheduler, please use get_last_lr().
warnings.warn("To get the last learning rate computed by the scheduler, "
{"eval_acc": 0.7816822429906543, "eval_f1": 0.7782037084362665, "eval_mcc": 0.5846169035386265, "eval_auc": 0.8839985990874502, "eval_precision": 0.8024470439531897, "eval_recall": 0.7825097242114136, "learning_rate": 0.0001956160743938891, "loss": 0.4473942193388939, "step": 400}
12/20/2022 14:57:46 - INFO - main - Loading features from cached file /home/weiyuan/Desktop/rbp/model/HNRNPA1/data_HNRNPA1/cached_dev_DNABERT3_101_dnaprom
12/20/2022 14:57:46 - INFO - main - ***** Running evaluation *****
12/20/2022 14:57:46 - INFO - main - Num examples = 5350
12/20/2022 14:57:46 - INFO - main - Batch size = 32
Evaluating: 100%|█████████████| 168/168 [03:16<00:00, 1.17s/it]
/home/weiyuan/mambaforge/envs/dnabert/lib/python3.6/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/home/weiyuan/mambaforge/envs/dnabert/lib/python3.6/site-packages/sklearn/metrics/_classification.py:900: RuntimeWarning: invalid value encountered in double_scalars
mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
12/20/2022 15:01:03 - INFO - main - ***** Eval results *****
12/20/2022 15:01:03 - INFO - main - acc = 0.4968224299065421
12/20/2022 15:01:03 - INFO - main - auc = 0.4861425095900458
12/20/2022 15:01:03 - INFO - main - f1 = 0.3319180819180819
12/20/2022 15:01:03 - INFO - main - mcc = 0.0
12/20/2022 15:01:03 - INFO - main - precision = 0.24841121495327104
12/20/2022 15:01:03 - INFO - main - recall = 0.5
/home/weiyuan/mambaforge/envs/dnabert/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:247: UserWarning: To get the last learning rate computed by the scheduler, please use get_last_lr().
warnings.warn("To get the last learning rate computed by the scheduler, "
{"eval_acc": 0.4968224299065421, "eval_f1": 0.3319180819180819, "eval_mcc": 0.0, "eval_auc": 0.4861425095900458, "eval_precision": 0.24841121495327104, "eval_recall": 0.5, "learning_rate": 0.0001889737628694786, "loss": 0.5985230031609535, "step": 500}

I would appreciate any feedback too, thank you.

Best Regards
WY

WY · Answer 2 · Mon Dec 26 2022 23:32:22 GMT+0800 (China Standard Time)

The same problem is discussed here: ThilinaRajapakse/simpletransformers#234

The solutions seem to be:

lower learning rate (--learning_rate 2e-5)
use smaller batch sizes (--per_gpu_train_batch_size 64)
perhaps to delete cache (--overwrite_cache)

Edited: some suggested parameters for what I used to get a smooth run

Nikita Bhandare · Answer 3 · Wed Mar 01 2023 11:07:00 GMT+0800 (China Standard Time)

Hi @CherWeiYuan I am facing a similar issue mentioned above and I tried your solution but I am still not seeing any improvement in the results.

Nataliya L · Answer 4 · Tue Jun 13 2023 19:30:10 GMT+0800 (China Standard Time)

Hi, @NikitaBhandare! This solution worked for me, but I had to manually delete all models from ~/.cache/huggingface/hub/ and set seeds for numpy, torch, cuda, etc. before loading a new model. Did you try this?
P.S. Looks like the lr=2e-4 is an error, it should definitely be 2e-5