Different performance between model saved as fine-tuned PLM and state_dict

Question

Different performance between model saved as fine-tuned PLM and state_dict

zedavid opened this issue 5 months ago · comments

Version
PyABSA = 2.3.4rc0
Torch = 2.1.1
Transformers = 4.35.2

Describe the bug
I've fine-tuned a model with config FAST_LSA_S_V2 using the same dataset using the APCTrainer. In one of the runs I saved it as a state_dict file and in the other a saved as PLM. I've then used the model on sample data using the APC.SentimentClassifier and the HF text-classification pipelines, but I get different results despite the model being trained the same way with the same data.

Code To Reproduce

Loading and testing the state_dict version:

sentiment_model = APC.SentimentClassifier('checkpoints/fast_lsa_s_v2_uber_acc_98.2_f1_98.14/')
examples = [
    "ty images city officials are standing their ground in defense of the law delivery workers like all workers deserve fair pay for their labor and we are disappointed that [B-ASP]Uber[E-ASP] doordash grubhub and relay disagree vilda vera mayuga head of the city's department of consumer and worker protection said in",
    "as prepared just a mile away for it to be delivered to me the food arrived stone cold i complained to [B-ASP]Uber[E-ASP] and [B-ASP]Uber[E-ASP] told me to go pound sand readers purporting to work for [B-ASP]Uber[E-ASP] left dozens of comments castigating me and others for not preemptively tipping arguing that such deliveries are not worth"
]

sentiment_model.predict(
    text=examples,
    eval_batch_size=32,
)

output:

[{'text': "ty images city officials are standing their ground in defense of the law delivery workers like all workers deserve fair pay for their labor and we are disappointed that Uber doordash grubhub and relay disagree vilda vera mayuga head of the city's department of consumer and worker protection said in",
  'aspect': ['Uber'],
  'sentiment': ['Negative'],
  'confidence': [0.9339152574539185],
  'probs': [array([0.93391526, 0.05876274, 0.007322  ], dtype=float32)],
  'ref_sentiment': ['-100'],
  'ref_check': [''],
  'perplexity': 'N.A.'},
 {'text': 'as prepared just a mile away for it to be delivered to me the food arrived stone cold i complained to Uber and Uber told me to go pound sand readers purporting to work for Uber left dozens of comments castigating me and others for not preemptively tipping arguing that such deliveries are not worthwh',
  'aspect': ['Uber', 'Uber', 'Uber'],
  'sentiment': ['Negative', 'Negative', 'Negative'],
  'confidence': [0.9557020664215088, 0.9557020664215088, 0.9557020664215088],
  'probs': [array([0.95570207, 0.03284235, 0.01145565], dtype=float32),
   array([0.95570207, 0.03284235, 0.01145565], dtype=float32),
   array([0.95570207, 0.03284235, 0.01145565], dtype=float32)],
  'ref_sentiment': ['-100', '-100', '-100'],
  'ref_check': ['', '', ''],
  'perplexity': 'N.A.'}]

With the HF text-classification pipeline

model_tokenizer = AutoTokenizer.from_pretrained('checkpoints/fast_lsa_s_v2_uber_acc_98.2_f1_98.14/fine-tuned-pretrained-model/')
model = AutoModelForSequenceClassification.from_pretrained('checkpoints/fast_lsa_s_v2_uber_acc_98.2_f1_98.14/fine-tuned-pretrained-model/')
sentiment_pipeline = pipeline('text-classification', model=model, tokenizer=model_tokenizer, device=1)
examples_no_tag = [{'text':re.sub(r"\[B-ASP\](.+?)\[E-ASP\]", r"\1", ex), 'text_pair': 'Uber'} for ex in examples]
sentiment_pipeline(examples_no_tag, top_k = 3)

Output:

[[{'label': 'Neutral', 'score': 0.38777175545692444},
  {'label': 'Positive', 'score': 0.3418353199958801},
  {'label': 'Negative', 'score': 0.27039292454719543}],
 [{'label': 'Neutral', 'score': 0.3863997459411621},
  {'label': 'Positive', 'score': 0.3450266420841217},
  {'label': 'Negative', 'score': 0.2685735821723938}]]

Expected behavior
I would expect there would be some correspondence between the output probability in both versions of the model.

Thanks!

Heng Yang · Answer 1 · Thu Feb 29 2024 21:10:43 GMT+0800 (China Standard Time)

The model saved as huggingface format is not intended as instant inference but further finetuing and the state_dict is the recommended save mode. If you want to run the model on pipeline, there is a model have been released at: https://huggingface.co/yangheng/deberta-v3-base-absa-v1.1

José David Lopes · Answer 2 · Mon Mar 04 2024 20:14:47 GMT+0800 (China Standard Time)

I see. What is required to make that model available to be run with huggingface pipeline?
Also, is there a checkpoint for the huggingface model? I would like to replicate the results I get with the pipeline with PyABSA.

Heng Yang · Answer 3 · Sun Mar 10 2024 01:03:26 GMT+0800 (China Standard Time)

I am sorry for that, it is tricky to train models compatible with huggingface pipeline, and I have cleaned the original materials such as codes so I am afraid that I cannot provide detailed help for that.