explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python

Home Page:https://spacy.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Annotating BILOU tags from another system

viksit opened this issue · comments

I have a domain specific NER system that generates BILOU tags for a given sentence. What would be the best way to integrate this information into spacy?

In #187, there's an example of how to train the system on new data. But I'm not entirely sure if there's way to do something like,

doc = nlp(u"this is a lion")
custom_ents = get_custom_ents(doc)
# >>  ['0', '0', '0', 'U-ANIMAL']
# function called annotate to combine this information into spacy's tokens/spans
annotate(doc, custom_ents) # how do we write this?
print([(i.text, i.label_) for i in doc.ents])
# >> [(lion, 'ANIMAL')]

You should be able to do:

doc.ents = [(label, start, end) for (label, start, end) in ents]

Example --- label "best buy" as a retailer:

nlp.entity.add_label(u'RETAILER')
retailer = nlp.strings[u'RETAILER')
doc = nlp(u'best buy is a pretty bad store')
doc.ents = [(retailer, 0, 2)]
span = doc[0:2]
best_buy = list(doc.ents)[0]
assert span.start == best_buy.start == 0
assert span.end == best_buy.end == 2

The API here isn't so polished. I'm surprised that the doc.ents = [] doesn't clear entities. We only add entities here. This should really be changed.

Here's some more detailed usage description:

  • Label should be an integer encoding of the label. You should register it with the NER as well.
  • Start is an integer indicating the start of the slice.index of the first token within the document. Watch out for changed indices from .merge() operations.
  • End is an integer indicating the end of the range

Finally, here's the relevant code:

https://github.com/spacy-io/spaCy/blob/master/spacy/tokens/doc.pyx#L178

@syllog1sm awesome, thanks for the information.

@syllog1sm couple of follow ups.


animal = nlp.vocab.strings[u"ANIMAL"]
doc1 = nlp(u"this is a lion and that is a royal bengal tiger that Michael Collins loved on the Apollo 11")

print()
print(list(doc1.ents))
print([(i.text, i.label_) for i in doc1.ents])
print([i.ent_iob_ for i in doc1])


>> [Michael Collins, Apollo]
>> [(u'Michael Collins', u'PERSON'), (u'Apollo', u'ORG')]
>> ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I', 'O', 'O', 'O', 'B', 'O']

old = [(i.label, i.start, i.end) for i in doc1.ents]

# derived from external ner
new = [(animal, 3, 4), (animal, 8, 11)]

doc1.ents = old + new
print()
print([(i.text, i.label_) for i in doc1.ents])
print([i.ent_iob_ for i in doc1])
print("entities: ", list(doc1.ents))

>> (u'lion', u'ANIMAL'), (u'royal bengal tiger', u'ANIMAL'), (u'Michael Collins', u'PERSON'), (u'Apollo', u'ORG')]
>> ['', '', '', 'B', '', '', '', '', 'B', 'I', 'I', '', 'B', 'I', '', '', '', 'B', '']
>> entities:  [lion, royal bengal tiger, Michael Collins, Apollo]

lion = doc1[3:4]
rbt = doc1[8:11]
lion_ent, rbt_ent, mc, apollo = list(doc1.ents)

assert lion_ent.start == lion.start
assert rbt_ent.start == rbt_ent.start

Questions,

  • It looks like spacy loses the 'O' tag after adding new entities. Is this on purpose?
  • I don't see L or U tags anywhere - why's that?

Thanks for the report — fixed.

I don't see L or U tags anywhere - why's that?

Currently the ent_iob field stores the IOB markers, even though the model is trained with BILUO tags. Maybe this should change — if you want to advocate for that, it's best if we start a new thread.

commented

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.