Parsing tagged text

Question

Parsing tagged text

thorunna opened this issue 4 years ago · comments

Þórunn Arnardóttir commented 4 years ago

Hi,

I'm trying to parse input text which has already been tagged using a model that includes a tagger. For this experiment, I'd like to disregard the tagger included in the parsing model but make the parser use the existing tags for tagging the text. Is this possible?

Julie Kallini · Answer 1 · Fri Jan 15 2021 09:25:19 GMT+0800 (China Standard Time)

@thorunna Did you have any luck finding out if this is possible?

Þórunn Arnardóttir · Answer 2 · Fri Jan 15 2021 17:10:38 GMT+0800 (China Standard Time)

@jkallini No I didn't, but please let me know if you have any!

Nikita Kitaev · Answer 3 · Sat Feb 06 2021 12:53:36 GMT+0800 (China Standard Time)

As of benepar 0.2.0a0, there is a new API integrated with NLTK that can more easily handle parsing text with existing tags. If the tags field of benepar.InputSentence is not None, the provided tags will be passed through to the output (but if the tags field is None, benepar will do its own pos tagging).

Julie Kallini · Answer 4 · Sun Feb 07 2021 08:28:59 GMT+0800 (China Standard Time)

As of benepar 0.2.0a0, there is a new API integrated with NLTK that can more easily handle parsing text with existing tags. If the tags field of benepar.InputSentence is not None, the provided tags will be passed through to the output (but if the tags field is None, benepar will do its own pos tagging).

Hi Nikita,
I see that version 0.2.0a0 does not have this feature with spaCy integration, which is the recommended way to use benepar. Are there any benefits to using spaCy integration if I am parsing English corpus data that is already tokenized and POS tagged by a human? I just want to make sure--will I have better results if I use the existing tags but integrate with NLTK instead of starting with raw text and using spaCy?

Nikita Kitaev · Answer 5 · Mon Feb 08 2021 02:11:25 GMT+0800 (China Standard Time)

With spaCy, you should be able to do the following to disable benepar's POS tagger and fall back on spaCy's instead.

if spacy.__version__.startswith('2'):
    nlp.add_pipe(benepar.BeneparComponent("benepar_en3", disable_tagger=True))
else:
    nlp.add_pipe("benepar", config={"model": "benepar_en3", "disable_tagger": True})

You can also inject your own POS tags into spaCy:

for i in range(len(spacy_sent)):
    spacy_sent[i].tag_ = my_tags[i]  # my_tags[i] is a string, e.g. NN

But the only thing the spaCy integration offers over NLTK is that it has non-destructive and better tokenization, as well as better sentence segmentation. If sentence segmentation, tokenization, and tagging are already done by a human I don't think spaCy offers anything (unless you like its API more than NLTK's).