janarishsaju / custom_ner_bert

BERT Finetuned with Custom Data for NER (Named Entity Recognition)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Janarish Saju C

AI/ML Engineer

Named Entity Recognition

10th December 2022


Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text


  1. EDA (Exploratory Data Analysis)
  2. How many solutions can you think of and why are you choosing your version of the solution?
  3. Error Analysis


There are several NER libraries for implementations using Python.

  1. BERT: https://huggingface.co/
  2. spaCy: https://spacy.io/usage/linguistic-features
  3. NLTK: https://www.nltk.org/book/ch07.html
  4. Stanford CoreNLP: https://stanfordnlp.github.io/CoreNLP/
  5. Polyglot: http://polyglot.readthedocs.io/en/latest/NamedEntityRecognition.html
  6. Apache OpenNLP: https://opennlp.apache.org/

Among those I go with the first two BERT and spaCy, these are the most favorite of mine

There exists other popular frameworks such OpenAI, NLTK and Stanford CoreNLP as well.

Advantages & Disadvantages

Methods Advantages Disadvantages

Time Efficiency

Pretrained with large datasets

Knowledge Transfer (Transfer Learning)

Computationally Cost Expensive

Not so good for Domain Based NER


Faster as it is built with C++ in low level

Good for Domain Based NER

Require more training data

More Complexity in Data Structures

NLTK Good for Base Level Analysis Requires Implementations from Scratch
Stanford Core NLP

No Idea, Since I never experienced these tools for any of my former projects



(*written here all the necessary steps carried out in the shared code)

1. Text Statistics

  • Average Word Length
  • Histogram and Bar Charts
  • Most Influenced Words
  • Most Influential Entities
  • Most Dominant Entity Labels
  • N-gram Exploration

(*From the above analysis we can see the corpus influences more about the Social Networks, Twitter, Facebook, Youtube, Please see the Colab Notebook)

2. Outlier Analysis

  • Found some outliers/uncommon behaviors when relating the entities with special characters. (See the Colab Notebook)

3. Assumptions & Thought

  • As discussed in Outlier Analysis. Need to take care of the following
    • Mismatch in name taggings
    • Special Characters + Entities Overlapping

(See the Colab Notebook)

  • Data Augmentation
    • An effective approach if we have lesser training samples
  • Data Annotation
    • Better data annotation pipeline tool avoid faulty datasets from Human Level
  • Ensembled Algorithms
    • Effectively utilizing ensemble algorithms archives better performance.

(*Discussed in last section)


(*written here all the necessary steps carried out in the shared code)


1. Data Preprocessing Steps

  • Data read/import
  • Handle data encoding issue
  • Data conversion as per model requirements
  • Data partition

2. Feature Engineering Steps

  • Unique input and output label features
  • Encode the labels to Numeric representation
  • Tokenize and embed the datasets

3. Model Initialization

  • Initialize the BERT model
  • Define the Task Name
  • Define the Tokenizer method

4. Hyper Parameter Turning

  • The following parameters were used
    • evaluation_strategy = "epoch",
    • learning_rate=1e-4,
    • per_device_train_batch_size=16,
    • per_device_eval_batch_size=16,
    • num_train_epochs=6,
    • weight_decay=1e-5,

5. Train the Model

  • Train the model with the below metrics
    • Train_dataset,
    • Eval_dataset,
    • Tokenizer,
    • Compute_metrics

6. Evaluate the Model

  • Evaluation done based on the 20 percent of data extracted for validation purposes from the training data.

7. Error Analysis

  • Accuracy on Validation Dataset
  • Confusion Matrix / Cross Table
  • Precision, Recall, F-Measure
  • K Fold Cross Validation can be applied for advanced analysis.

(* It is explicitly seen that the entity I-Location, B-Location and O have more mismatches. We should analyze and look deep into those entities. Please see the Colab Notebook)

8. Prediction Module

  • Read the test data from disk
  • Handle Encoding and Alignment issues
  • Data Conversion
  • Feed the Converted Test Data to the fine turned model and Get Predictions
  • Get Label Predictions using ArgMax function
  • Get Probabilistic Prediction Scores using SoftMax function
  • Store every results in a DataFrame

9. Export the Results

  • Export test results in text file separated by "\t"


  • BERT has the advantage over other Machine Learning and Deep Learning models.
  • As it is a transformer technique pretrained with huge datasets.
  • And it save us a lot of time for training
  • Although it has a disadvantage, Heavier BERT model is computationally expensive


More Ideas for Making Stronger NER Formatting Models

1. Replace Pretrained embeddings with Contextual Embeddings such as BERT or ELMo


2. Combine Embeddings with Character Level, CNNs or RNNs for handling unseen words


3. Combine Linguistic Features with your Embeddings


4. Add Self-Attention Mechanisms to your RNN



Online Sources:

  1. https://github.com/dmoonat/Named-Entity-Recognition/blob/main/Fine_tune_NER.ipynb
  2. https://medium.com/@andrewmarmon/fine-tuned-named-entity-recognition-with-hugging-face-bert-d51d4cb3d7b5
  3. https://pub.towardsai.net/top-5-approaches-to-named-entity-recognition-ner-in-2022-38afdf022bf1
  4. https://neptune.ai/blog/exploratory-data-analysis-natural-language-processing-tools

Google Colab:



Exploratory Data Analysis:





BERT Finetuned with Custom Data for NER (Named Entity Recognition)


Language:Jupyter Notebook 100.0%