Annotating BILOU tags from another system

Question

Annotating BILOU tags from another system

viksit opened this issue 8 years ago · comments

I have a domain specific NER system that generates BILOU tags for a given sentence. What would be the best way to integrate this information into spacy?

In #187, there's an example of how to train the system on new data. But I'm not entirely sure if there's way to do something like,

doc = nlp(u"this is a lion")
custom_ents = get_custom_ents(doc)
# >>  ['0', '0', '0', 'U-ANIMAL']
# function called annotate to combine this information into spacy's tokens/spans
annotate(doc, custom_ents) # how do we write this?
print([(i.text, i.label_) for i in doc.ents])
# >> [(lion, 'ANIMAL')]

Matthew Honnibal · Answer 1 · Tue Jul 26 2016 17:17:37 GMT+0800 (China Standard Time)

You should be able to do:

doc.ents = [(label, start, end) for (label, start, end) in ents]

Example --- label "best buy" as a retailer:

nlp.entity.add_label(u'RETAILER')
retailer = nlp.strings[u'RETAILER')
doc = nlp(u'best buy is a pretty bad store')
doc.ents = [(retailer, 0, 2)]
span = doc[0:2]
best_buy = list(doc.ents)[0]
assert span.start == best_buy.start == 0
assert span.end == best_buy.end == 2

The API here isn't so polished. I'm surprised that the doc.ents = [] doesn't clear entities. We only add entities here. This should really be changed.

Here's some more detailed usage description:

Label should be an integer encoding of the label. You should register it with the NER as well.
Start is an integer indicating the start of the slice.index of the first token within the document. Watch out for changed indices from .merge() operations.
End is an integer indicating the end of the range

Finally, here's the relevant code:

https://github.com/spacy-io/spaCy/blob/master/spacy/tokens/doc.pyx#L178

Viksit Gaur · Answer 2 · Wed Jul 27 2016 07:33:48 GMT+0800 (China Standard Time)

@syllog1sm awesome, thanks for the information.

Viksit Gaur · Answer 3 · Thu Aug 04 2016 06:25:28 GMT+0800 (China Standard Time)

@syllog1sm couple of follow ups.


animal = nlp.vocab.strings[u"ANIMAL"]
doc1 = nlp(u"this is a lion and that is a royal bengal tiger that Michael Collins loved on the Apollo 11")

print()
print(list(doc1.ents))
print([(i.text, i.label_) for i in doc1.ents])
print([i.ent_iob_ for i in doc1])


>> [Michael Collins, Apollo]
>> [(u'Michael Collins', u'PERSON'), (u'Apollo', u'ORG')]
>> ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I', 'O', 'O', 'O', 'B', 'O']

old = [(i.label, i.start, i.end) for i in doc1.ents]

# derived from external ner
new = [(animal, 3, 4), (animal, 8, 11)]

doc1.ents = old + new
print()
print([(i.text, i.label_) for i in doc1.ents])
print([i.ent_iob_ for i in doc1])
print("entities: ", list(doc1.ents))

>> (u'lion', u'ANIMAL'), (u'royal bengal tiger', u'ANIMAL'), (u'Michael Collins', u'PERSON'), (u'Apollo', u'ORG')]
>> ['', '', '', 'B', '', '', '', '', 'B', 'I', 'I', '', 'B', 'I', '', '', '', 'B', '']
>> entities:  [lion, royal bengal tiger, Michael Collins, Apollo]

lion = doc1[3:4]
rbt = doc1[8:11]
lion_ent, rbt_ent, mc, apollo = list(doc1.ents)

assert lion_ent.start == lion.start
assert rbt_ent.start == rbt_ent.start

Questions,

It looks like spacy loses the 'O' tag after adding new entities. Is this on purpose?
I don't see L or U tags anywhere - why's that?

Matthew Honnibal · Answer 4 · Sun Oct 23 2016 21:52:16 GMT+0800 (China Standard Time)

Thanks for the report — fixed.

I don't see L or U tags anywhere - why's that?

Currently the ent_iob field stores the IOB markers, even though the model is trained with BILUO tags. Maybe this should change — if you want to advocate for that, it's best if we start a new thread.

lock · Answer 5 · Wed May 09 2018 15:39:13 GMT+0800 (China Standard Time)

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.