sagorbrur / bnlp

BNLP is a natural language processing toolkit for Bengali Language.

Home Page:https://pypi.org/project/bnlp-toolkit/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Need more description about the CRF based NER model train details

ArupDas15 opened this issue · comments

Please give further details as to how to train the NER model on custom data. The dataset (https://github.com/MISabic/NER-Bangla-Dataset) on which the model has been trained has both IOB and BIOES tags. I am unable to understand what tagging style is used to train the model as per the example given (screenshot attached for reference).

image

Also from the screenshot, it is appearing to me as though you have passed the same example you used for training (twice with the same sentence ) for testing (again twice with the same sentence ). This again is not clear to me.

Please give more details about the architecture of the CRF model used to train the data. I tried to understand these details from your paper (https://arxiv.org/pdf/2102.00405.pdf) but unfortunately, I could not understand. Hence please shed some details into these aspects as I am unable to understand the internals and it appears to me like a black box. Hoping for a quick and positive response.

Hello @ArupDas15 ,
Thanks for raising this issue.
Below I am breaking down your questions and providing answers

  • Which tagging format I can apply for training NER?
    Answer: As the module has no dependencies with tagging method, you can use IOB/BILOU/BIOES. Only you should keep in mind that your testing tag should be similar to the training tag. I have used BIOES format from https://github.com/MISabic/NER-Bangla-Dataset
  • Why did you pass the same example for training and testing?
    Answer: It's just an example. First, you need to preprocess your data like the example format. Then you can divide it into train test chunks and pass it in the training method.
  • What is the training model? Where can I find details about that model?
    Answer: I have used scikit-learn crf-suit for the NER task. This API provides you with details about this model argument. Also if you want to understand about CRF please read this. paper1, paper2

I am hoping this answer fulfills your requirements.
regards

Hi @sagorbrur,
Thank you very much for your prompt reply. Is it possible to use your pre-trained model and train further? i.e. can I feed in new instances and update the trained model (bn_ber.pkl)? Or do I need to train from scratch...?

Hi @ArupDas15,
I don't think so. You need to train from scratch.
You can merge your new datasets with NER-Bangla-Dataset datasets in a similar tagging format and then train a new model. Remember, your new datasets should be in a similar tagging format to that datasets.

Hi @sagorbrur,
I tried to reproduce your results just to be sure that I am doing things correctly. I trained on the training data and tested on the test data from https://github.com/MISabic/NER-Bangla-Dataset. As per your results (https://arxiv.org/pdf/2102.00405.pdf) I was expecting 66.88 as F1 score but I am getting F1 score of 90.35 and this is the same value I have obtained for accuracy as well.

I am attaching a screenshot for your kind reference:
image

I am doing something wrong here for sure, can you please help me out?

Hello @ArupDas15 ,
There is nothing wrong with your training.
I think it's sklearn metrics problem.
You can predict using train model check F1 score using seqeval
Here's my F1 score using seqeval:

Selection_004