ankit-kothari/TF2_Classification

Code

Google Colaboratory

Intended Audience

Knowledge about basics of Tensorflow 2.0 and Keras
Knowledge about LSTM, RNN's
Knowledge about word embeddings

Dataset Cleaning and Extraction

NLP PART1: Data Cleaning, Extraction and Topic Modeling

The shape of the initial dataset is (2999999, 3) reviews with columns ['rating',' title', 'review']
Extracted all reviews with the word or mention of Amazon to reduce the dataset to (112106, 3)
Performed text cleaning.
Performed Topic Modeling on the dataset using NMF and assigned topics to all the reviews.
Filtere the data with the following categories ['books', 'video-quality', 'refund-and-return', 'movies', 'music', 'games']
The following classification task models the data to predict one of the above categories.

Word Embeddings

It is used to create a vector relationship between the words in the corpus, There are a number of options,

Glove, Word2Vec
Download the pretrained Glove Vector embeddings
Create a dictionary of word2vector from the corpus in the dataset
Create an embedding matrix (we can restric the Max Vocab Size)

Tokenizer and Padding

Creating a text to sequecne using TF tokenizer.
Creating a word2index dictionary.
Padding to make it a constant sized sequence.

Model Architechture

Terminology

# N = number of samples
# T = sequence length
# D = number of input features (embedding dimension)
# M = number of hidden units
# K = number of output units
# DU = Dense Units

TF2.0 NLP: Part2 Multi Class Text Classification BiLSTM

Google Colaboratory

In this architecture we the sequence once from N(1) to N(T=Sequence length) and then we start from N(T=Sequence length) to N(1). This proves really helpful in remembering long term dependencies.

input_ = Input(shape=(MAX_SEQUENCE_LENGTH,))                                              (NxT)
embeddings = embedding_layer(input_)                                                      (NxTXD)
lstm_1 = Bidirectional(LSTM(128, return_sequences=True, return_state=False))(embeddings)  (NxTx2M)
dropout = Dropout(0.3)(lstm_1)
lstm_2 = Bidirectional(LSTM(256, return_sequences=True, return_state=False, dropout=0.3)) (NxTx2M) #2M because of the BiLSTM
lstm_layer = lstm_2(dropout)
gmpl= GlobalMaxPool1D(name='gmpl')(lstm_layer)                                            (Nx2M)
dense = Dense(64,kernel_initializer=tf.keras.initializers.glorot_normal(seed=None),
    kernel_regularizer=tf.keras.regularizers.l1(0.01)
    activity_regularizer=tf.keras.regularizers.l2(0.01), activation='relu')(gmpl)         (NxDU)
batch_norm = BatchNormalization()(dense)                                                  (NXK)
dense_1 = Dense(6, activation='softmax')
output = dense_1(batch_norm)
model = Model(input_, output)

BiLSTM with pre padding Train and Val Loss

BiLSTM with pre padding Train and Val accuracy

Best Output: accuracy: 0.9552 ; val_accuracy: 0.9474

TF2.0 NLP PART 3: NLP: Multi Class Text Classification LSTM

Google Colaboratory

input_ = Input(shape=(MAX_SEQUENCE_LENGTH,))                                                           (NxT)
embeddings = embedding_layer(input_)                                                                   (NxTXD)
bilstm = LSTM(32, return_sequences=True, return_state=False, dropout=0.2)(embeddings)                  (NxTXM)
lstm = LSTM(64, return_sequences=True, return_state=False, dropout=0.2)                                (NxTXM)
lstm_layer = lstm(bilstm)
gmpl= GlobalMaxPool1D(name='gmpl')(lstm_layer)                                                         (NXM)
dense = Dense(6)                                                                                       (NXK)
output = dense(gmpl)
model = Model(input_, output)

LSTM with pre padding Train and Val Loss

LSTM with post padding Train and Val accuracy

Best Output: accuracy: 0.9216 ; val_accuracy: 0.9295

TF2.0 NLP: Part 4 Multi Class Text Classification BiLSTM with post padding

Google Colaboratory

input_ = Input(shape=(MAX_SEQUENCE_LENGTH,))                                              (NxT)
embeddings = embedding_layer(input_)                                                      (NxTXD)
lstm_1 = Bidirectional(LSTM(128, return_sequences=True, return_state=False))(embeddings)  (NxTx2M)
dropout = Dropout(0.3)(lstm_1)
lstm_2 = Bidirectional(LSTM(256, return_sequences=True, return_state=False, dropout=0.3)) (NxTx2M) #2M because of the BiLSTM
lstm_layer = lstm_2(dropout)
gmpl= GlobalMaxPool1D(name='gmpl')(lstm_layer)                                            (Nx2M)
dense = Dense(64,kernel_initializer=tf.keras.initializers.glorot_normal(seed=None),
    kernel_regularizer=tf.keras.regularizers.l1(0.01)
    activity_regularizer=tf.keras.regularizers.l2(0.01), activation='relu')(gmpl)         (NxDU)
batch_norm = BatchNormalization()(dense)                                                  (NXK)
dense_1 = Dense(6, activation='softmax')
output = dense_1(batch_norm)
model = Model(input_, output)

BiLSTM with post padding Train and Val Loss

BiLSTM with post padding Train and Val accuracy

Best Output: accuracy: 0.9099 ; val_accuracy: 0.9090

TF2.0 NLP: Part5 Multi Class Text Classification CNN-1D

Google Colaboratory

input_ = Input(shape=(MAX_SEQUENCE_LENGTH,))
embeddings = embedding_layer(input_)
drop_embed_layer = SpatialDropout1D(.2, name='drop_embed')(embeddings)

conv1 = Conv1D(256, 20,strides=1, activation='relu')(drop_embed_layer)
maxp_1 = GlobalMaxPool1D(name='maxp_1')(conv1)

conv2= Conv1D(256, 10, activation='relu' )(drop_embed_layer)
maxp_2 = GlobalMaxPool1D(name='maxp_2')(conv2)

conv3= Conv1D(256, 5, activation='relu' )(drop_embed_layer)
maxp_3 = GlobalMaxPool1D(name='maxp_3')(conv3)

concat = concatenate([maxp_1, maxp_2, maxp_3])

dense = Dense(64,kernel_initializer=tf.keras.initializers.glorot_normal(seed=None),
    kernel_regularizer=tf.keras.regularizers.l1(0.01),
    activity_regularizer=tf.keras.regularizers.l2(0.05), activation='relu')(concat)
batch_norm = tf.keras.layers.Dropout(0.2)(dense)

output = Dense((len(K))(batch_norm)
model = Model(input_, output)

Multi-Layer CNN-1D with pre-padding Train and Val Loss

Multi-Layer CNN-1D with pre-padding Train and Val accuracy

Output: accuracy: 0.9078 ; val_accuracy: 0.9144

Model Outputs (epochs:10)

Model outputs by Architeture

ankit-kothari / TF2_Classification

Code

Intended Audience

Dataset Cleaning and Extraction

Word Embeddings

Tokenizer and Padding

Model Architechture

Terminology

TF2.0 NLP: Part2 Multi Class Text Classification BiLSTM

TF2.0 NLP PART 3: NLP: Multi Class Text Classification LSTM

TF2.0 NLP: Part 4 Multi Class Text Classification BiLSTM with post padding

TF2.0 NLP: Part5 Multi Class Text Classification CNN-1D

Model Outputs (epochs:10)

About

Languages