james-wukong/sentiment-analysis

Pipeline

Step 1: Description

Step 2: Exploratory Data Analysis and Preprocessing

Step 3: Load the Pretrained Model

Step 4: Prepare Tensor Dataset for Training, Validation and Testing

Step 5: Setting up BERT Pretrained Model

Step 6: Setting Up Optimizer, Loss, and Metrics

Step 7: Train the Extended NN Model

Step 8: Evaluate the New Model with Testing Dataset

Step 9: Fine-tuning hyperparameters and NN Model to improve performance

Step 10: Make Predictions

1. Description

This project is to analyze twitter comments and classify them into categories. The data source is from Twitter and Reddit Sentimental analysis Dataset, under license 'CC BY-NC-SA 4.0 DEED' which means we are free to share and adapt this dataset.

We are going to build the model with pre-trained model 'bert-base-uncased'.

According to the description of dataset, we have the following explanation for category column:

0 Indicating it is a Neutral Tweet/Comment
1 Indicating a Postive Sentiment
-1 Indicating a Negative Tweet/Comment

However, we need to convert the category values into {0: 'Neutral', 1: 'Positive', 2: 'Negative'} to fit the model

2. Exploratory Data Analysis and Preprocessing

Explore dataset by doing:

checking null values and choose either imputing or dropping
add text_length feature and check the distribution of text length
check the distribution of target label

Pre-processing:

Clean the text content by removing hyperlinks, HTML tags, and stopwords, etc

3. Load the Pretrained Model

Load "bert-base-uncased" as the base model
Load its tokenizer to tokenize the dataset

4. Prepare Tensor Dataset for Training, Validation and Testing

Split dataset into training, validation, and testing datasets
Tokenize text feature in each dataset
Load datasets into tensor dataset by using from_tensor_slice
Create batches for each dataset, and shuffle training dataset

5. Setting up BERT Pretrained Model

Setting pretrained model using TFBertForSequenceClassification
Get BertConfig from pretrained model
Freeze all layers and remove the last 2 layer from pretrained model (dropout layer and output layer)

6. Setting Up Optimizer, Loss, and Metrics

Compile the new model, with optimizer Adam, loss SparseCategoricalCrossEntropy, and metrics accuracy

7. Train the Extended NN Model

Add LSTM, BatchNormalization and Dropout layer blocks to pretrained model
Add output Dense layer with 3 units(category values) and activation function 'softmax'
Train the new model with training dataset, include callbacks: EarlyStoppingAtMinLoss and CustomCheckpoint

8. Evaluate the New Model with Testing Dataset

Evaluate the new model by checking the result returned by custom_model.evaluate(test_ds)
Plot Learning Curve of loss and accuracy during training
Compare the predictions and actual values
Evaluating with confusion matrix

9. Fine-tuning hyperparameters and NN Model to improve performance

Update hyperparameters, such as learning rate, batch_size, to improve model's performance
Add/Remove NN layers to improve the performance
Repeat until get better result

10. Make Predictions

Tokenize a random sentence so that we can use it in the custom model
Check the result provided by the new model
Repeat several times to check the performance of model

james-wukong / sentiment-analysis