amir-zeldes / xrenner

eXternally configurable REference and Non Named Entity Recognizer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Example case including parser

lucienbaumgartner opened this issue · comments

Hi, I'm trying to get xrenner to work, but I run into problems with the tokenizer from the transformers package. Here is the code I'm trying to run:

import xrenner

data = """
1	The	the	DT	DT	_	4	det	_	_
2	New	New	NNP	NNP	_	3	nn	_	_
3	Zealand	Zealand	NNP	NNP	_	4	nn	_	_
4	government	government	NN	NN	_	5	nsubj	_	_
5	intends	intend	VBZ	VBZ	_	0	root	_	_
6	to	to	TO	TO	_	7	aux	_	_
7	hold	hold	VB	VB	_	5	xcomp	_	_
8	two	two	CD	CD	_	9	num	_	_
9	referendums	referendum	NNS	NNS	_	7	dobj	_	_
10	to	to	TO	TO	_	11	aux	_	_
11	reach	reach	VB	VB	_	7	vmod	_	_
12	a	a	DT	DT	_	13	det	_	_
13	verdict	verdict	NN	NN	_	11	dobj	_	_
14	on	on	IN	IN	_	13	prep	_	_
15	the	the	DT	DT	_	16	det	_	_
16	flag	flag	NN	NN	_	14	pobj	_	_
17	,	,	,	,	_	0	punct	_	_
18	at	at	IN	IN	_	7	prep	_	_
19	an	an	DT	DT	_	21	det	_	_
20	estimated	estimate	VBN	VBN	_	21	amod	_	_
21	cost	cost	NN	NN	_	18	pobj	_	_
22	of	of	IN	IN	_	21	prep	_	_
23	NZ	NZ	NNP	NNP	_	24	nn	_	_
24	$	$	$	$	_	22	pobj	_	_
25	26	@card@	CD	CD	_	26	number	_	_
26	million	million	CD	CD	_	24	num	_	_
27	,	,	,	,	_	0	punct	_	_
28	although	although	IN	IN	_	32	mark	_	_
29	a	a	DT	DT	_	31	det	_	_
30	recent	recent	JJ	JJ	_	31	amod	_	_
31	poll	poll	NN	NN	_	32	nsubj	_	_
32	found	find	VBD	VBD	_	5	advcl	_	_
33	only	only	RB	RB	_	35	advmod	_	_
34	a	a	DT	DT	_	35	det	_	_
35	quarter	quarter	NN	NN	_	38	nsubj	_	_
36	of	of	IN	IN	_	35	prep	_	_
37	citizens	citizen	NNS	NNS	_	36	pobj	_	_
38	favoured	favour	VBD	VBD	_	32	ccomp	_	_
39	changing	change	VBG	VBG	_	38	xcomp	_	_
40	the	the	DT	DT	_	41	det	_	_
41	flag	flag	NN	NN	_	39	dobj	_	_
42	.	.	.	.	_	0	punct	_	_
"""
print(data)

xrenner = xrenner.Xrenner()

sgml_result = xrenner.analyze(infile=data, out_format="sgml")
print(sgml_result)

This prompts the following AttributeError:

Traceback (most recent call last):
  File "/Users/lucienbaumgartner/phd/projects/done/tc_methods_paper/src/animacy-classification/test.py", line 56, in <module>
    sgml_result = xrenner.analyze(infile=data, out_format="sgml")
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/xrenner/modules/xrenner_xrenner.py", line 163, in analyze
    seq_preds = lex.sequencer.predict_proba(s_texts)
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/xrenner/modules/xrenner_sequence.py", line 304, in predict_proba
    preds = self.tagger.predict(sentences)
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py", line 369, in predict
    feature = self.forward(batch)
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py", line 608, in forward
    self.embeddings.embed(sentences)
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/flair/embeddings/token.py", line 71, in embed
    embedding.embed(sentences)
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/flair/embeddings/base.py", line 60, in embed
    self._add_embeddings_internal(sentences)
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/flair/embeddings/legacy.py", line 1197, in _add_embeddings_internal
    for sentence in sentences
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/flair/embeddings/legacy.py", line 1197, in <listcomp>
    for sentence in sentences
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 357, in tokenize
    tokenized_text = split_on_tokens(no_split_token, text)
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 351, in split_on_tokens
    for token in tokenized_text
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 351, in <genexpr>
    for token in tokenized_text
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/transformers/tokenization_bert.py", line 219, in _tokenize
    for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/transformers/tokenization_bert.py", line 416, in tokenize
    elif self.strip_accents:
AttributeError: 'BasicTokenizer' object has no attribute 'strip_accents'

I suspect that this has something to do with the format of the data-object. In the documentation it is not clear which parser you use in order to transform/annotate plaintext to the conll-format, that's why I'm using an already parsed text string in the right format. I tried the spacy_conllu-parser as well as the conllu-parser, but neither work for me. Would it be possible for you to provide an example from A-Z including parsing plaintext to the conll-format?

I'm using python v3.7.11 with the following package-versions:

(animacy3.7.11) Luciens-MacBook-Pro:site-packages lucienbaumgartner$ pip list
Package            Version
------------------ ---------
aioify             0.4.0
attrs              21.2.0
beautifulsoup4     4.9.3
blis               0.7.4
bpemb              0.3.3
bs4                0.0.1
catalogue          2.0.4
certifi            2021.5.30
charset-normalizer 2.0.3
click              7.1.2
cloudpickle        1.6.0
conll              0.0.0
conllu             4.4
cycler             0.10.0
cymem              2.0.5
decorator          4.4.2
Deprecated         1.2.12
en-core-web-sm     3.1.0
filelock           3.0.12
flair              0.6.1
Flask              2.0.1
ftfy               6.0.3
future             0.18.2
gdown              3.13.0
gensim             4.0.1
hyperopt           0.2.5
idna               3.2
importlib-metadata 3.10.1
iniconfig          1.1.1
iso639             0.1.4
itsdangerous       2.0.1
Janome             0.4.1
Jinja2             3.0.1
joblib             1.0.1
jsonschemanlplab   3.0.1.1
kiwisolver         1.3.1
konoha             4.6.5
langdetect         1.0.9
lxml               4.6.3
MarkupSafe         2.0.1
matplotlib         3.4.2
module-wrapper     0.3.1
mpld3              0.3
murmurhash         1.0.5
networkx           2.5.1
nltk               3.6.2
numpy              1.21.1
overrides          3.1.0
packaging          21.0
pathy              0.6.0
Pillow             8.3.1
pip                21.2.1
pluggy             0.13.1
preshed            3.0.5
protobuf           3.17.3
py                 1.10.0
pydantic           1.8.2
pyjsonnlp          0.2.33
pyparsing          2.4.7
pyrsistent         0.18.0
PySocks            1.7.1
pytest             6.2.4
python-dateutil    2.8.2
python-dotenv      0.19.0
python-Levenshtein 0.12.2
regex              2021.7.6
requests           2.26.0
sacremoses         0.0.45
scikit-learn       0.24.2
scipy              1.7.0
segtok             1.5.10
sentencepiece      0.1.96
setuptools         47.1.0
six                1.16.0
smart-open         5.1.0
soupsieve          2.2.1
spacy              3.1.1
spacy-conll        3.0.2
spacy-legacy       3.0.8
sqlitedict         1.7.0
srsly              2.4.1
stanza             1.2.2
stdlib-list        0.8.0
syntok             1.3.1
tabulate           0.8.9
thinc              8.0.8
threadpoolctl      2.2.0
tokenizers         0.8.1rc2
toml               0.10.2
torch              1.9.0
tqdm               4.61.2
transformers       3.3.0
typer              0.3.2
typing-extensions  3.10.0.0
urllib3            1.26.6
wasabi             0.8.2
wcwidth            0.2.5
Werkzeug           2.0.1
wheel              0.36.2
wrapt              1.12.1
xgboost            0.90
xmltodict          0.12.0
xrenner            2.2.0.0
xrennerjsonnlp     0.0.5
zipp               3.5.0

Thanks a lot in advance!

Hi and thanks for reporting this bug - I don't think the parser is the cause, as it looks like the error is being triggered by some incompatibility with the transformers tokenizer version compared to the version the model was trained with. I assume you're using the pre-trained eng_flair_nner_distilbert.pt in models/_sequence_taggers?

I can confirm that that model works with:

flair                         0.6.1
torch                         1.6.0+cu101
transformers                  3.5.1

So transformers itself could be the problem - can you try 3.5.1? You may also want to try out this newer model based on Electra rather than DistilBERT, which is a bit more accurate and trained on the latest GUM7:

https://corpling.uis.georgetown.edu/amir/download/eng_flair_nner_electra_gum7.pt

To use this, you would need to edit the English model's config.ini file (if the model is not yet unzipped, you will need to unzip eng.xrm to do that), and set:

# Optional path to serialized pre-trained sequence classifier for entity head classification
sequencer=eng_flair_nner_electra_gum7.pt

Finally, as an accurate parser for input to the system, I would recommend a transformer based parser over Spacy, such as Diaparser:

https://github.com/Unipisa/diaparser

Here is a highly accurate pretrained model for GUM7:

https://corpling.uis.georgetown.edu/amir/download/en_gum7.electra-base.diaparser.pt

Hope that helps!

Thanks a lot for the quick reply and your suggestions, they were very helpful! Yes, exactly, I'm using the pre-trained eng_flair_nner_distilbert.pt.
I upgraded transformers to 3.5.1, so that I have the same setting as you:

flair                         0.6.1
torch                         1.6.0
transformers                  3.5.1

I cannot install torch v1.6.0+cu101 on macOS, as far as I know, hence I'm using touch 1.6.0. Unfortunately, the same error still occurs, if I use the pre-trained eng_flair_nner_distilbert.pt. With the Electra model you suggested, however, the code runs fine. I tried both models (DistilBERT and Electra) with i) a string in conll-format, ii) using the Diaparser you kindly suggested (with the pretrained model for GUM7), as well as iii) with the Spacy parser. While it works with the Spacy output, the Diaparser-output does not get annotated at all. I tried this:

import xrenner
from diaparser.parsers import Parser

txt = "Trees play a significant role in reducing erosion and moderating the climate. They remove carbon dioxide from the atmosphere and store large quantities of carbon in their tissues. Trees and forests provide a habitat for many species of animals and plants. Tropical rainforests are among the most biodiverse habitats in the world. Trees provide shade and shelter, timber for construction, fuel for cooking and heating, and fruit for food as well as having many other uses. In parts of the world, forests are shrinking as trees are cleared to increase the amount of land available for agriculture. Because of their longevity and usefulness, trees have always been revered, with sacred groves in various cultures, and they play a role in many of the world's mythologies."

parser = Parser.load('en_gum7.electra-base.diaparser.pt')
data = parser.predict(txt, text='en')

xrenner = xrenner.Xrenner()
result = xrenner.analyze(data, "html")
print(result)

Coercing the Diaparse output to a string also didn't change anything. Do you maybe see what I'm doing wrong here?

If the Electra model works I wouldn't bother with getting DistilBERT to run, the Electra one is about +4 F1 on entity type recognition.

For the parser I should have been clearer: Diaparser is just a parser, not an NLP toolkit like Stanza etc. It only does dependency attachment and relation types on preprocessed data (tokenized and sentence splitted). And you will also need to get POS tags and lemmas from somewhere else. However it is substantially more accurate than say Stanza (coincidentally also about +4 LAS out of the box). To run it you need to feed it a list of sentences, each a list of tokens (so list of lists). See the Diaparser documentation for details. If you can tolerate somewhat lower accuracy, Stanza should work pretty well too though, and predicts everything from plain text. I've also seen Trankit around, which is much like Stanza but transformer based, so that might be worth a try as well (I think it uses RoBERTa for everything?)