Different Results between FARM and Huggingface pretrained

Question

Different Results between FARM and Huggingface pretrained

DHOFM opened this issue 3 years ago · comments

Warning: This is related to German HateSpeech Detection, so there will be examples, that maybe would insult you.

There are different results using FARM and deepset/bert-base-german-cased-hatespeech-GermEval18Coarse from huggingface We can try huggingface API or simple use the Transformers like this:

from transformers import pipeline

sentiment_analysis = pipeline("text-classification", model="deepset/bert-base-german-cased-hatespeech-GermEval18Coarse")
If I use doc_classification.py from FARM which uses the original dataset and German BERT I get other predictions like this.

Pretrained from huggingface:
"Da haben die Goldstücke wieder Folklore gespielt" (German hatespeech against refugees)
Label: OFFENSE
Confidence Score: 0.663702666759491

FARM:
Da haben die Goldstücke wieder Folklore gespielt', 'label': 'OTHER', 'probability': 0.69867057},

So in this case the FARM Prediction would be wrong but comparing these two examples:

"Was hier passiert, ist eine Islamisierung"
Label: OFFENSE
Confidence Score: 0.887525737285614
'context': 'Was hier passiert, ist eine Islamisierung', 'label': 'OFFENSE', 'probability': 0.53441}]} (FARM)

And
"Was hier passiert, ist keine Islamisierung"
Label: OFFENSE
Confidence Score: 0.8187811970710754
'context': 'Was hier passiert, ist keine Islamisierung', 'label': 'OTHER', 'probability': 0.509054}] (FARM)

FARM works better, because the negation is part of the prediction

Maybe the parameters used for training the huggingface model are different, because if I „play“ with the FARM Script, changing
evaluate_every = 90 (from 100) I get:

[{'task': 'text_classification', 'task_name': 'text_classification', 'predictions': [{'start': None, 'end': None, 'context': 'Da haben die Goldstücke wieder Folklore gespielt', 'label': 'OTHER', 'probability': 0.71600443}, {'start': None, 'end': None, 'context': 'Da haben die Musiker wieder Folklore gespielt', 'label': 'OTHER', 'probability': 0.90432215}, {'start': None, 'end': None, 'context': 'Was hier passiert, ist eine Islamisierung', 'label': 'OFFENSE', 'probability': 0.5677871}, {'start': None, 'end': None, 'context': 'Was hier passiert, ist keine Islamisierung', 'label': 'OFFENSE', 'probability': 0.5210618}]}]

Which is also not regarding the negation (KEINE Islamisierung)

I made some more experiments, also with more granular Labels and the 2019 Dataset I found here: https://projects.fzai.h-da.de/iggsa/data-2019/

I used the Gold Master as Testdata. I also noticed that more epochs will not result in better quality. This maybe could be a result of overturning or catastrophic forgetting as described by Ruder (or https://ruder.io/state-of-transfer-learning-in-nlp/) It is also said by Ruder that „…, large pretrained models (e.g. BERT-Large) are prone to degenerate performance when fine-tuned on tasks with small training sets. In practice, the observed behavior is often “on-off”: the model either works very well or does not work at all as can be seen in the figure below. Understanding the conditions and causes of this behavior is an open research question.“ - so maybe it is little bit luck.

I used the experiments of FARM and added a few more inferencing sentences taken from twitter - here are the results, and the Scoring of the training, maybe you find it useful:

Experiment:
https://public-mlflow.deepset.ai/#/experiments/2/runs/4e6d49981dee4ea8b44732ad258f7cf3

Result:
[{'task': 'text_classification', 'task_name': 'text_classification', 'predictions': [{'start': None, 'end': None, 'context': 'Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot sei', 'label': 'OFFENSE', 'probability': 0.8166369}, {'start': None, 'end': None, 'context': 'Martin Müller spielt Handball in Berlin', 'label': 'OTHER', 'probability': 0.86606246}, {'start': None, 'end': None, 'context': 'Da haben die Goldstücke wieder Folklore gespielt', 'label': 'OTHER', 'probability': 0.81829953}, {'start': None, 'end': None, 'context': 'Sachlich bleiben? Am Arsch. Was für ein sacksaudummer unverschämter Drecksack. Mal wieder.', 'label': 'OFFENSE', 'probability': 0.9283451}]}, {'task': 'text_classification', 'task_name': 'text_classification', 'predictions': [{'start': None, 'end': None, 'context': 'Der politsystemschmarotzende Stillstands-und-Plutokratie-Wahrer, Klimaschutzbremser, Freiheitsraser usw. Merz schwafelt von mangelndem Risikobewusstsein. Sachlich bleiben? Am Arsch. Was für ein sacksaudummer unverschämter Drecksack. Mal wieder.', 'label': 'OFFENSE', 'probability': 0.93744975}, {'start': None, 'end': None, 'context': 'Der politsystemschmarotzende Stillstands-und-Plutokratie-Wahrer, Klimaschutzbremser, Freiheitsraser usw. Merz schwafelt von mangelndem Risikobewusstsein.', 'label': 'OTHER', 'probability': 0.6976832}, {'start': None, 'end': None, 'context': 'Was hier passiert, ist keine Islamisierung', 'label': 'OTHER', 'probability': 0.53867924}, {'start': None, 'end': None, 'context': 'Was hier passiert, ist eine Islamisierung', 'label': 'OFFENSE', 'probability': 0.5642283}]}, {'task': 'text_classification', 'task_name': 'text_classification', 'predictions': [{'start': None, 'end': None, 'context': 'Fotzen-Fritz ist ein menschenverachtender Penner.', 'label': 'OFFENSE', 'probability': 0.95295376}, {'start': None, 'end': None, 'context': 'merz hat zumindest seinen Beitrag für Straffreiheit bei Vergewaltigung in der Ehe geleistet.Das könnte man ja jetzt Assi finden, aber dafür hat er sich 2004 auch für die Abschaffung des Kündigungsschutz eingebracht ', 'label': 'OTHER', 'probability': 0.8628088}]}]

Experiment:
https://public-mlflow.deepset.ai/#/experiments/2/runs/5d31974067974d3cbd350e4cc8ba5372

Result:
[{'task': 'text_classification', 'task_name': 'text_classification', 'predictions': [{'start': None, 'end': None, 'context': 'Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot sei', 'label': 'INSULT', 'probability': 0.7283285}, {'start': None, 'end': None, 'context': 'Martin Müller spielt Handball in Berlin', 'label': 'OTHER', 'probability': 0.50663143}, {'start': None, 'end': None, 'context': 'Da haben die Goldstücke wieder Folklore gespielt', 'label': 'OTHER', 'probability': 0.42929974}, {'start': None, 'end': None, 'context': 'Sachlich bleiben? Am Arsch. Was für ein sacksaudummer unverschämter Drecksack. Mal wieder.', 'label': 'INSULT', 'probability': 0.72942}]}, {'task': 'text_classification', 'task_name': 'text_classification', 'predictions': [{'start': None, 'end': None, 'context': 'Der politsystemschmarotzende Stillstands-und-Plutokratie-Wahrer, Klimaschutzbremser, Freiheitsraser usw. Merz schwafelt von mangelndem Risikobewusstsein. Sachlich bleiben? Am Arsch. Was für ein sacksaudummer unverschämter Drecksack. Mal wieder.', 'label': 'INSULT', 'probability': 0.70200384}, {'start': None, 'end': None, 'context': 'Der politsystemschmarotzende Stillstands-und-Plutokratie-Wahrer, Klimaschutzbremser, Freiheitsraser usw. Merz schwafelt von mangelndem Risikobewusstsein.', 'label': 'INSULT', 'probability': 0.4689992}, {'start': None, 'end': None, 'context': 'Was hier passiert, ist keine Islamisierung', 'label': 'OTHER', 'probability': 0.50008565}, {'start': None, 'end': None, 'context': 'Was hier passiert, ist eine Islamisierung', 'label': 'OTHER', 'probability': 0.45455748}]}, {'task': 'text_classification', 'task_name': 'text_classification', 'predictions': [{'start': None, 'end': None, 'context': 'Islamisierung und Afrikanisierung läuft...', 'label': 'ABUSE', 'probability': 0.52468437}, {'start': None, 'end': None, 'context': 'Fotzen-Fritz ist ein menschenverachtender Penner.', 'label': 'INSULT', 'probability': 0.68054426}, {'start': None, 'end': None, 'context': 'merz hat zumindest seinen Beitrag für Straffreiheit bei Vergewaltigung in der Ehe geleistet.Das könnte man ja jetzt Assi finden, aber dafür hat er sich 2004 auch für die Abschaffung des Kündigungsschutz eingebracht ', 'label': 'OTHER', 'probability': 0.6218756}]}]

Germeval19…

Experiment:
https://public-mlflow.deepset.ai/#/experiments/2/runs/1285375a32864b9fb1c48741bc80616a

Result:
[{'task': 'text_classification', 'task_name': 'text_classification', 'predictions': [{'start': None, 'end': None, 'context': 'Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot sei', 'label': 'INSULT', 'probability': 0.9630955}, {'start': None, 'end': None, 'context': 'Martin Müller spielt Handball in Berlin', 'label': 'OTHER', 'probability': 0.99413645}, {'start': None, 'end': None, 'context': 'Da haben die Goldstücke wieder Folklore gespielt', 'label': 'OTHER', 'probability': 0.9920414}, {'start': None, 'end': None, 'context': 'Sachlich bleiben? Am Arsch. Was für ein sacksaudummer unverschämter Drecksack. Mal wieder.', 'label': 'INSULT', 'probability': 0.99872476}]}, {'task': 'text_classification', 'task_name': 'text_classification', 'predictions': [{'start': None, 'end': None, 'context': 'Der politsystemschmarotzende Stillstands-und-Plutokratie-Wahrer, Klimaschutzbremser, Freiheitsraser usw. Merz schwafelt von mangelndem Risikobewusstsein. Sachlich bleiben? Am Arsch. Was für ein sacksaudummer unverschämter Drecksack. Mal wieder.', 'label': 'INSULT', 'probability': 0.9995358}, {'start': None, 'end': None, 'context': 'Der politsystemschmarotzende Stillstands-und-Plutokratie-Wahrer, Klimaschutzbremser, Freiheitsraser usw. Merz schwafelt von mangelndem Risikobewusstsein.', 'label': 'INSULT', 'probability': 0.7903242}, {'start': None, 'end': None, 'context': 'Was hier passiert, ist keine Islamisierung', 'label': 'OTHER', 'probability': 0.9992895}, {'start': None, 'end': None, 'context': 'Was hier passiert, ist eine Islamisierung', 'label': 'OTHER', 'probability': 0.99898916}]}, {'task': 'text_classification', 'task_name': 'text_classification', 'predictions': [{'start': None, 'end': None, 'context': 'Islamisierung und Afrikanisierung läuft...', 'label': 'OTHER', 'probability': 0.99923444}, {'start': None, 'end': None, 'context': 'Fotzen-Fritz ist ein menschenverachtender Penner.', 'label': 'INSULT', 'probability': 0.9954383}, {'start': None, 'end': None, 'context': 'merz hat zumindest seinen Beitrag für Straffreiheit bei Vergewaltigung in der Ehe geleistet.Das könnte man ja jetzt Assi finden, aber dafür hat er sich 2004 auch für die Abschaffung des Kündigungsschutz eingebracht ', 'label': 'OTHER', 'probability': 0.999169}]}]

Experiment:
https://public-mlflow.deepset.ai/#/experiments/2/runs/c35d4ea33da74b33ab39c67823a2b6c8

Result:
[{'task': 'text_classification', 'task_name': 'text_classification', 'predictions': [{'start': None, 'end': None, 'context': 'Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot sei', 'label': 'INSULT', 'probability': 0.46150762}, {'start': None, 'end': None, 'context': 'Martin Müller spielt Handball in Berlin', 'label': 'OTHER', 'probability': 0.44524938}, {'start': None, 'end': None, 'context': 'Da haben die Goldstücke wieder Folklore gespielt', 'label': 'PROFANITY', 'probability': 0.50535744}, {'start': None, 'end': None, 'context': 'Sachlich bleiben? Am Arsch. Was für ein sacksaudummer unverschämter Drecksack. Mal wieder.', 'label': 'INSULT', 'probability': 0.5270026}]}, {'task': 'text_classification', 'task_name': 'text_classification', 'predictions': [{'start': None, 'end': None, 'context': 'Der politsystemschmarotzende Stillstands-und-Plutokratie-Wahrer, Klimaschutzbremser, Freiheitsraser usw. Merz schwafelt von mangelndem Risikobewusstsein. Sachlich bleiben? Am Arsch. Was für ein sacksaudummer unverschämter Drecksack. Mal wieder.', 'label': 'INSULT', 'probability': 0.48413393}, {'start': None, 'end': None, 'context': 'Der politsystemschmarotzende Stillstands-und-Plutokratie-Wahrer, Klimaschutzbremser, Freiheitsraser usw. Merz schwafelt von mangelndem Risikobewusstsein.', 'label': 'ABUSE', 'probability': 0.43944603}, {'start': None, 'end': None, 'context': 'Was hier passiert, ist keine Islamisierung', 'label': 'OTHER', 'probability': 0.436592}, {'start': None, 'end': None, 'context': 'Was hier passiert, ist eine Islamisierung', 'label': 'OTHER', 'probability': 0.40293643}]}, {'task': 'text_classification', 'task_name': 'text_classification', 'predictions': [{'start': None, 'end': None, 'context': 'Islamisierung und Afrikanisierung läuft...', 'label': 'OTHER', 'probability': 0.44942927}, {'start': None, 'end': None, 'context': 'Fotzen-Fritz ist ein menschenverachtender Penner.', 'label': 'INSULT', 'probability': 0.6045533}, {'start': None, 'end': None, 'context': 'merz hat zumindest seinen Beitrag für Straffreiheit bei Vergewaltigung in der Ehe geleistet.Das könnte man ja jetzt Assi finden, aber dafür hat er sich 2004 auch für die Abschaffung des Kündigungsschutz eingebracht ', 'label': 'OTHER', 'probability': 0.39297706}]}]

Experiment:
https://public-mlflow.deepset.ai/#/experiments/2/runs/a33c14cb18f049f9833426eb46baf69c

Result:
[{'task': 'text_classification', 'task_name': 'text_classification', 'predictions': [{'start': None, 'end': None, 'context': 'Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot sei', 'label': 'INSULT', 'probability': 0.8453335}, {'start': None, 'end': None, 'context': 'Martin Müller spielt Handball in Berlin', 'label': 'OTHER', 'probability': 0.8630693}, {'start': None, 'end': None, 'context': 'Da haben die Goldstücke wieder Folklore gespielt', 'label': 'OTHER', 'probability': 0.4445184}, {'start': None, 'end': None, 'context': 'Sachlich bleiben? Am Arsch. Was für ein sacksaudummer unverschämter Drecksack. Mal wieder.', 'label': 'INSULT', 'probability': 0.977749}]}, {'task': 'text_classification', 'task_name': 'text_classification', 'predictions': [{'start': None, 'end': None, 'context': 'Der politsystemschmarotzende Stillstands-und-Plutokratie-Wahrer, Klimaschutzbremser, Freiheitsraser usw. Merz schwafelt von mangelndem Risikobewusstsein. Sachlich bleiben? Am Arsch. Was für ein sacksaudummer unverschämter Drecksack. Mal wieder.', 'label': 'INSULT', 'probability': 0.9642144}, {'start': None, 'end': None, 'context': 'Der politsystemschmarotzende Stillstands-und-Plutokratie-Wahrer, Klimaschutzbremser, Freiheitsraser usw. Merz schwafelt von mangelndem Risikobewusstsein.', 'label': 'INSULT', 'probability': 0.8090083}, {'start': None, 'end': None, 'context': 'Was hier passiert, ist keine Islamisierung', 'label': 'OTHER', 'probability': 0.9304664}, {'start': None, 'end': None, 'context': 'Was hier passiert, ist eine Islamisierung', 'label': 'OTHER', 'probability': 0.8834291}]}, {'task': 'text_classification', 'task_name': 'text_classification', 'predictions': [{'start': None, 'end': None, 'context': 'Islamisierung und Afrikanisierung läuft...', 'label': 'OTHER', 'probability': 0.9483292}, {'start': None, 'end': None, 'context': 'Fotzen-Fritz ist ein menschenverachtender Penner.', 'label': 'INSULT', 'probability': 0.9519503}, {'start': None, 'end': None, 'context': 'merz hat zumindest seinen Beitrag für Straffreiheit bei Vergewaltigung in der Ehe geleistet.Das könnte man ja jetzt Assi finden, aber dafür hat er sich 2004 auch für die Abschaffung des Kündigungsschutz eingebracht ', 'label': 'OTHER', 'probability': 0.96095407}]}]

So as a result, as we can se the 2019 OneEpoch Experiment also works better with our first question („Goldstücke“. Maybe it is because the hugginface Model is not2018 Data related or because of the prediction heads
FARM uses. So it would be great to share the parameters for the hugginface model, to try to use them in FARM. I used the seed Julian Risch uses here https://github.com/julian-risch/KONVENS2019_and_LREC2020/tree/germeval2019 for the 2019 experiment - but of course, with the actual FARM Repo

Kind regards,

Dirk

Timo Moeller · Answer 1 · Tue Jul 27 2021 22:16:07 GMT+0800 (China Standard Time)

Hey @DHOFM thanks for the detailed description.

Maybe the parameters used for training the huggingface model are different, because if I „play“ with the FARM Script, changing evaluate_every = 90 (from 100) I get ...

So you are saying you cannot replicate the same performance when training models? I thought you meant the inference of our deepset/bert-base-german-cased-hatespeech-GermEval18Coarse is different in FARM vs HF/Transformers?

Getting different models that predict different things when training (also with same seed) is not uncommon. There are many free variables and especially for small datasets, as you correctly cited Ruders paper, the resulting models vary a lot.

stale · Answer 2 · Thu Nov 25 2021 01:42:00 GMT+0800 (China Standard Time)

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 21 days if no further activity occurs.