Is there any way to replace the current NER ?

Question

Is there any way to replace the current NER ?

coolcoder001 opened this issue 2 years ago · comments

Shounak commented 2 years ago

Hi ,
Thanks a lot for the project .It is indeed wonderful.

However , I would like to replace NER engine . I want to use Flair , instead of Spacy.

Can I do that ?

Laurel Orr · Answer 1 · Thu Apr 21 2022 16:25:16 GMT+0800 (China Standard Time)

Hi!

Yes, you can do this. I have a list of possible extractors here. If you want to implement your own extractor function and add it there, you should be able to trigger it being used via this argument here.

As long as you have the same inputs/outputs, it should be possible.

Shounak · Answer 2 · Thu Apr 21 2022 19:39:24 GMT+0800 (China Standard Time)

Hi,
Thanks a lot for the quick response. :)
My extractor function using flair takes input as a string and outputs the extracted entities in a pandas dataframe.

def entity_recognition(text):
    """Given a text document, run a NER on it using flair and return a dataframe with the following columns
    text: actual raw text input
    entity: identified entity text
    entity_start: character start position of entity in raw text
    entity_end: character end position of entity in raw text
    """
    import pandas as pd
    from flair.data import Sentence
    from flair.models import SequenceTagger
    tagger_fast = SequenceTagger.load('ner-ontonotes-fast')
    sentence = Sentence(text)
    tagger_fast.predict(sentence, mini_batch_size=16)
    entities = []
    for i in tqdm(range(len(sentence.to_dict(tag_type='ner')['entities']))):
        str_main=None
        start_pos = -1
        end_pos = -1
        if str(sentence.to_dict(tag_type=
                                'ner')['entities'][i]['labels']
                [0]).split()[0] in 'ORG':
            str_main = str(sentence.to_dict(tag_type='ner')['entities'][i]
                        ['text'])
            start_pos = sentence.to_dict(tag_type='ner')['entities'][i]['start_pos']
            end_pos = sentence.to_dict(tag_type='ner')['entities'][i]['end_pos']
            
        elif str(sentence.to_dict(tag_type=
                                    'ner')['entities'][i]['labels']
                    [0]).split()[0] in 'PERSON':
            str_main = str(sentence.to_dict(tag_type=
                                        'ner')['entities'][i]['text'])
            start_pos = sentence.to_dict(tag_type='ner')['entities'][i]['start_pos']
            end_pos = sentence.to_dict(tag_type='ner')['entities'][i]['end_pos']
            
        elif str(sentence.to_dict(tag_type=
                                    'ner')['entities'][i]['labels']
                    [0]).split()[0] in 'GPE':
            str_main = str(sentence.to_dict(tag_type=
                                        'ner')['entities'][i]['text'])
            start_pos = sentence.to_dict(tag_type='ner')['entities'][i]['start_pos']
            end_pos = sentence.to_dict(tag_type='ner')['entities'][i]['end_pos']
        if str_main is not None and (start_pos!=-1 and end_pos!=-1):
            entities.append([str_main, start_pos, end_pos])
    
    entities = pd.DataFrame(entities, columns=['entity', 'entity_start', 'entity_end'])
    entities['text'] = text
    return entities

Can you please help me with the changes I need to make to this function so that it can work with bootleg?

Thanks in advance.

Laurel Orr · Answer 3 · Fri Apr 22 2022 02:02:08 GMT+0800 (China Standard Time)

So I went ahead and added your function as an example in the branch here. If you use the annotator and use the extract method of custom, it should trigger your extractor. I haven't tested it but it should get you started.

Shounak · Answer 4 · Fri Apr 22 2022 20:52:33 GMT+0800 (China Standard Time)

Hi @lorr1 , thanks a lot for your help. You are so nice and awesome :)

I am able to run this code using the Flair NER engine.

However, if I have to do some more changes, can I directly push them to the branch you created? or do I need to raise PR ?

Laurel Orr · Answer 5 · Sat Apr 23 2022 05:03:59 GMT+0800 (China Standard Time)

How about you raise PRs? I'll pretty much approve everything, but I'd like to keep track of what you're finding difficult/useful to implement.

Thanks!