Annotating BILOU tags from another system
viksit opened this issue · comments
I have a domain specific NER system that generates BILOU tags for a given sentence. What would be the best way to integrate this information into spacy?
In #187, there's an example of how to train the system on new data. But I'm not entirely sure if there's way to do something like,
doc = nlp(u"this is a lion")
custom_ents = get_custom_ents(doc)
# >> ['0', '0', '0', 'U-ANIMAL']
# function called annotate to combine this information into spacy's tokens/spans
annotate(doc, custom_ents) # how do we write this?
print([(i.text, i.label_) for i in doc.ents])
# >> [(lion, 'ANIMAL')]
You should be able to do:
doc.ents = [(label, start, end) for (label, start, end) in ents]
Example --- label "best buy" as a retailer:
nlp.entity.add_label(u'RETAILER')
retailer = nlp.strings[u'RETAILER')
doc = nlp(u'best buy is a pretty bad store')
doc.ents = [(retailer, 0, 2)]
span = doc[0:2]
best_buy = list(doc.ents)[0]
assert span.start == best_buy.start == 0
assert span.end == best_buy.end == 2
The API here isn't so polished. I'm surprised that the doc.ents = []
doesn't clear entities. We only add entities here. This should really be changed.
Here's some more detailed usage description:
- Label should be an integer encoding of the label. You should register it with the NER as well.
- Start is an integer indicating the start of the slice.index of the first token within the document. Watch out for changed indices from
.merge()
operations. - End is an integer indicating the end of the range
Finally, here's the relevant code:
https://github.com/spacy-io/spaCy/blob/master/spacy/tokens/doc.pyx#L178
@syllog1sm awesome, thanks for the information.
@syllog1sm couple of follow ups.
animal = nlp.vocab.strings[u"ANIMAL"]
doc1 = nlp(u"this is a lion and that is a royal bengal tiger that Michael Collins loved on the Apollo 11")
print()
print(list(doc1.ents))
print([(i.text, i.label_) for i in doc1.ents])
print([i.ent_iob_ for i in doc1])
>> [Michael Collins, Apollo]
>> [(u'Michael Collins', u'PERSON'), (u'Apollo', u'ORG')]
>> ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I', 'O', 'O', 'O', 'B', 'O']
old = [(i.label, i.start, i.end) for i in doc1.ents]
# derived from external ner
new = [(animal, 3, 4), (animal, 8, 11)]
doc1.ents = old + new
print()
print([(i.text, i.label_) for i in doc1.ents])
print([i.ent_iob_ for i in doc1])
print("entities: ", list(doc1.ents))
>> (u'lion', u'ANIMAL'), (u'royal bengal tiger', u'ANIMAL'), (u'Michael Collins', u'PERSON'), (u'Apollo', u'ORG')]
>> ['', '', '', 'B', '', '', '', '', 'B', 'I', 'I', '', 'B', 'I', '', '', '', 'B', '']
>> entities: [lion, royal bengal tiger, Michael Collins, Apollo]
lion = doc1[3:4]
rbt = doc1[8:11]
lion_ent, rbt_ent, mc, apollo = list(doc1.ents)
assert lion_ent.start == lion.start
assert rbt_ent.start == rbt_ent.start
Questions,
- It looks like spacy loses the 'O' tag after adding new entities. Is this on purpose?
- I don't see L or U tags anywhere - why's that?
Thanks for the report — fixed.
I don't see L or U tags anywhere - why's that?
Currently the ent_iob
field stores the IOB markers, even though the model is trained with BILUO tags. Maybe this should change — if you want to advocate for that, it's best if we start a new thread.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.