kaustubhbhavsar/intent-classification

bert bidirectional-lstm gensim glove-embeddings keras latent-dirichlet-allocation latent-semantic-analysis lstm multiclass-classification multilayer-perceptron-network ner nltk pyldavis python tensorflow tfhub topic-modeling word2vec text-vectorization-layer fastapi

Intentium

Multiclass Classification | Topic Modelling | Model Serving

What Is It?

HARRY: Ron, please play 'Tere Naina'...
RON: Ummm.. Okay.. Playing 'Tere Naina'...

Question is - how did Ron completed the task here? Roughly, he first found the intent of the task, i.e., what needs to be done? Here, intent was 'PlayMusic'. So, Ron thus started a music app and played the above mentioned song.

What if we could automatically find out the intent of the user and responding likewise at earliest thus keeping the user sticking around the app? A win-win situation for both the user and the app creator.

Intent classification is the task of classifying the text into one of the many intents (multiclass-classification).

Aim of the project is to perform multiclass intent classification (and also subtask: topic modelling).

(back to top)

Summary

The data file contains approximately 15,000 data points. These data points are divided into three datasets: training, testing, and validation, and saved in three separate CSV files. Data analysis is performed on the training dataset. Data consists of two columns: text and intent. "Intent" is the target column, which contains 7 classes with an approximately equal distribution. It is also found that the text ranges from 10 to 125 characters approximately, and the range of words in the text ranges from 2 to 25. NER tag that was observed most across the documents: cardinal. There are also numerous "Date", "Person", and "GPE" - NER tags present across document. Further, we also found out the NER tags that dominated per class. Topic modelling is performed using LDA, and results are plotted using pyLDAvis.

Utilizing MLP, LSTM, and BERT (along with various embedding algorithms), intent classification is carried out and comparasion analysis is completed. Summarizing results:

MLP with Integer Encoding

Model 1.1 :

INPUT => EMBEDDING (int encoding) => FLATTEN => FC => RELU => DO => FC => SOFTMAX

Model 1.2 :

  INPUT => EMBEDDING (int encoding) => GAP1D => FC => RELU => DO => FC => SOFTMAX

Model 1.1's flatten layer leads to more trainable parameters compared to model 1.2 that uses GlobalAveragePooling1D. Better performance is achieved by model 1.2.

MLP with Glove Embedding

Model 2.1 :

INPUT => EMBEDDING (Glove Embedding) => GMP1D => FC => RELU => DO => FC => RELU => DO => FC => SOFTMAX

Model 2.2 :

INPUT => EMBEDDING (Glove Embedding) => GAP1D => FC => RELU => DO => FC => RELU => DO => FC => SOFTMAX

Model 2.2 (uses GlobalAveragePool1D) is performing better than model 2.1 (uses GlobalMaxPool1D) in fewer epochs and in less wall time.

LSTM

Model 3.1 (Vanilla LSTM) :

INPUT => EMBEDDING (int encoding) => LSTM => FC => RELU => DO => FC => SOFTMAX

Model 3.2 (BidirectionalLSTM) :

INPUT => EMBEDDING (int encoding) => BiLSTM => FC => RELU => DO => FC => SOFTMAX

Model 3.3 (Stacked LSTM) :

INPUT => EMBEDDING (int encoding) => LSTM => LSTM => FC => RELU => DO => FC => SOFTMAX

Model 3.3 (stacked LSTM) with fewer parameters can achieve approximately similar or even better performance compared to model 3.1 (vanilla LSTM), but requires more training epochs to achieve the same. Model 3.2 (BidirectionalLSTM) achieved better performance compared to other two models.

Small BERT

The model quickly overfits the data. Most likely because the data isn't difficult or sophisticated for a BERT-like model to learn.

Following table summarizes the results:

Model No.	Trainable Params	Epochs	Train Acc	Train Loss	Valid Acc	Valid Loss	Test Acc	Test Loss	Wall Time
1.1	83,975	10	0.9690	0.1017	0.9675	0.1363	0.9671	0.1276	9.2s
1.2	76,039	15	0.9885	0.0437	0.9712	0.1214	0.9710	0.1055	12.9s
2.1	8,775	50	0.9604	0.1161	0.9420	0.1941	0.9406	0.2205	36.8s
2.2	8,775	30	0.9661	0.1063	0.9564	0.1215	0.9562	0.1206	22.1s
3.1	76,695	20	0.9677	0.0901	0.9658	0.3097	0.9617	0.3599	5m31s
3.2	77,799	20	0.9654	0.1004	0.9672	0.1602	0.9632	0.2153	7m29s
3.3	76,159	25	0.9549	0.1240	0.9447	0.3509	0.9499	0.3423	11m29s
4	-	5	0.9885	0.0437	0.9712	0.1214	0.9757	0.1063	2m33s

Overall, model 4, or the BERT model, provides the best performance. However, it has quite broad requirements. With fewer parameters, Model 1.2 produces performance that is nearly identical.

(back to top)

Directory Structure

├── Data Files                            # Data files              
    └── ...         
├── Models                                # Saved models              
    └── ...         
├── Other Files                           # Miscellaneous files
    └── ...
├── 1_data_analysis                       # Data Analysis file
├── 2_topic_modelling                     # Topic Modelling file
├── 3_mlp_intEncoding                     # MLP Classifier (with int encoding) file
├── 4_mlp_gloveEmbedding                  # MLP Classifier (with glove embedding) file
├── 5_lstm                                # LSTM Classifier (vanilla, stacked, bidirectional) file
├── 6_smallBert                           # smallBert Classifier file

(back to top)

Languge and Libraries

Language: Python
Libraries: Tensorflow, Keras, Tensorflow-Hub, Scikit-Learn, Gensim, NLTK, Spacy, PyLDAviz, Re, WordCloud, Matplotlib, Seaborn, Numpy, Pandas.

(back to top)

Final Notes

Notebooks can be run directly on google colab (make sure to upload required .py files in working directory if required).

The codebase has been meticulously documented, incorporating comprehensive docstrings and comments. Please review these annotations, as they provide valuable insights into the functionality and operation of the code.

(back to top)

About

Multiclass Intent Classification using MLP, LSTM, and BERT (subtask: Topic Modelling).

bert bidirectional-lstm gensim glove-embeddings keras latent-dirichlet-allocation latent-semantic-analysis lstm multiclass-classification multilayer-perceptron-network ner nltk pyldavis python tensorflow tfhub topic-modeling word2vec text-vectorization-layer fastapi

Apache License 2.0

Languages

Language:Jupyter Notebook 97.6%Language:PureBasic 2.3%Language:Python 0.1%