Step 1: Description
Step 2: Exploratory Data Analysis and Preprocessing
Step 3: Load the Pretrained Model
Step 4: Prepare Tensor Dataset for Training, Validation and Testing
Step 5: Setting up BERT Pretrained Model
Step 6: Setting Up Optimizer, Loss, and Metrics
Step 7: Train the Extended NN Model
Step 8: Evaluate the New Model with Testing Dataset
Step 9: Fine-tuning hyperparameters and NN Model to improve performance
Step 10: Make Predictions
This project is to analyze twitter comments and classify them into categories. The data source is from Twitter and Reddit Sentimental analysis Dataset, under license 'CC BY-NC-SA 4.0 DEED' which means we are free to share and adapt this dataset.
We are going to build the model with pre-trained model 'bert-base-uncased'.
According to the description of dataset, we have the following explanation for category column:
- 0 Indicating it is a Neutral Tweet/Comment
- 1 Indicating a Postive Sentiment
- -1 Indicating a Negative Tweet/Comment
However, we need to convert the category values into {0: 'Neutral', 1: 'Positive', 2: 'Negative'} to fit the model
Explore dataset by doing:
- checking null values and choose either imputing or dropping
- add text_length feature and check the distribution of text length
- check the distribution of target label
Pre-processing:
- Clean the text content by removing hyperlinks, HTML tags, and stopwords, etc
- Load "bert-base-uncased" as the base model
- Load its tokenizer to tokenize the dataset
- Split dataset into training, validation, and testing datasets
- Tokenize text feature in each dataset
- Load datasets into tensor dataset by using from_tensor_slice
- Create batches for each dataset, and shuffle training dataset
- Setting pretrained model using TFBertForSequenceClassification
- Get BertConfig from pretrained model
- Freeze all layers and remove the last 2 layer from pretrained model (dropout layer and output layer)
- Compile the new model, with optimizer Adam, loss SparseCategoricalCrossEntropy, and metrics accuracy
- Add LSTM, BatchNormalization and Dropout layer blocks to pretrained model
- Add output Dense layer with 3 units(category values) and activation function 'softmax'
- Train the new model with training dataset, include callbacks: EarlyStoppingAtMinLoss and CustomCheckpoint
- Evaluate the new model by checking the result returned by custom_model.evaluate(test_ds)
- Plot Learning Curve of loss and accuracy during training
- Compare the predictions and actual values
- Evaluating with confusion matrix
- Update hyperparameters, such as learning rate, batch_size, to improve model's performance
- Add/Remove NN layers to improve the performance
- Repeat until get better result
- Tokenize a random sentence so that we can use it in the custom model
- Check the result provided by the new model
- Repeat several times to check the performance of model