- Knowledge about basics of Tensorflow 2.0 and Keras
- Knowledge about LSTM, RNN's
- Knowledge about word embeddings
NLP PART1: Data Cleaning, Extraction and Topic Modeling
- The shape of the initial dataset is (2999999, 3) reviews with columns ['rating',' title', 'review']
- Extracted all reviews with the word or mention of Amazon to reduce the dataset to (112106, 3)
- Performed text cleaning.
- Performed Topic Modeling on the dataset using NMF and assigned topics to all the reviews.
- Filtere the data with the following categories ['books', 'video-quality', 'refund-and-return', 'movies', 'music', 'games']
- The following classification task models the data to predict one of the above categories.
It is used to create a vector relationship between the words in the corpus, There are a number of options,
- Glove, Word2Vec
- Download the pretrained Glove Vector embeddings
- Create a dictionary of word2vector from the corpus in the dataset
- Create an embedding matrix (we can restric the Max Vocab Size)
- Creating a text to sequecne using TF tokenizer.
- Creating a word2index dictionary.
- Padding to make it a constant sized sequence.
# N = number of samples
# T = sequence length
# D = number of input features (embedding dimension)
# M = number of hidden units
# K = number of output units
# DU = Dense Units
In this architecture we the sequence once from N(1) to N(T=Sequence length) and then we start from N(T=Sequence length) to N(1). This proves really helpful in remembering long term dependencies.
input_ = Input(shape=(MAX_SEQUENCE_LENGTH,)) (NxT)
embeddings = embedding_layer(input_) (NxTXD)
lstm_1 = Bidirectional(LSTM(128, return_sequences=True, return_state=False))(embeddings) (NxTx2M)
dropout = Dropout(0.3)(lstm_1)
lstm_2 = Bidirectional(LSTM(256, return_sequences=True, return_state=False, dropout=0.3)) (NxTx2M) #2M because of the BiLSTM
lstm_layer = lstm_2(dropout)
gmpl= GlobalMaxPool1D(name='gmpl')(lstm_layer) (Nx2M)
dense = Dense(64,kernel_initializer=tf.keras.initializers.glorot_normal(seed=None),
kernel_regularizer=tf.keras.regularizers.l1(0.01)
activity_regularizer=tf.keras.regularizers.l2(0.01), activation='relu')(gmpl) (NxDU)
batch_norm = BatchNormalization()(dense) (NXK)
dense_1 = Dense(6, activation='softmax')
output = dense_1(batch_norm)
model = Model(input_, output)
BiLSTM with pre padding Train and Val Loss
BiLSTM with pre padding Train and Val accuracy
Best Output: accuracy: 0.9552 ; val_accuracy: 0.9474
input_ = Input(shape=(MAX_SEQUENCE_LENGTH,)) (NxT)
embeddings = embedding_layer(input_) (NxTXD)
bilstm = LSTM(32, return_sequences=True, return_state=False, dropout=0.2)(embeddings) (NxTXM)
lstm = LSTM(64, return_sequences=True, return_state=False, dropout=0.2) (NxTXM)
lstm_layer = lstm(bilstm)
gmpl= GlobalMaxPool1D(name='gmpl')(lstm_layer) (NXM)
dense = Dense(6) (NXK)
output = dense(gmpl)
model = Model(input_, output)
LSTM with pre padding Train and Val Loss
LSTM with post padding Train and Val accuracy
Best Output: accuracy: 0.9216 ; val_accuracy: 0.9295
input_ = Input(shape=(MAX_SEQUENCE_LENGTH,)) (NxT)
embeddings = embedding_layer(input_) (NxTXD)
lstm_1 = Bidirectional(LSTM(128, return_sequences=True, return_state=False))(embeddings) (NxTx2M)
dropout = Dropout(0.3)(lstm_1)
lstm_2 = Bidirectional(LSTM(256, return_sequences=True, return_state=False, dropout=0.3)) (NxTx2M) #2M because of the BiLSTM
lstm_layer = lstm_2(dropout)
gmpl= GlobalMaxPool1D(name='gmpl')(lstm_layer) (Nx2M)
dense = Dense(64,kernel_initializer=tf.keras.initializers.glorot_normal(seed=None),
kernel_regularizer=tf.keras.regularizers.l1(0.01)
activity_regularizer=tf.keras.regularizers.l2(0.01), activation='relu')(gmpl) (NxDU)
batch_norm = BatchNormalization()(dense) (NXK)
dense_1 = Dense(6, activation='softmax')
output = dense_1(batch_norm)
model = Model(input_, output)
BiLSTM with post padding Train and Val Loss
BiLSTM with post padding Train and Val accuracy
Best Output: accuracy: 0.9099 ; val_accuracy: 0.9090
input_ = Input(shape=(MAX_SEQUENCE_LENGTH,))
embeddings = embedding_layer(input_)
drop_embed_layer = SpatialDropout1D(.2, name='drop_embed')(embeddings)
conv1 = Conv1D(256, 20,strides=1, activation='relu')(drop_embed_layer)
maxp_1 = GlobalMaxPool1D(name='maxp_1')(conv1)
conv2= Conv1D(256, 10, activation='relu' )(drop_embed_layer)
maxp_2 = GlobalMaxPool1D(name='maxp_2')(conv2)
conv3= Conv1D(256, 5, activation='relu' )(drop_embed_layer)
maxp_3 = GlobalMaxPool1D(name='maxp_3')(conv3)
concat = concatenate([maxp_1, maxp_2, maxp_3])
dense = Dense(64,kernel_initializer=tf.keras.initializers.glorot_normal(seed=None),
kernel_regularizer=tf.keras.regularizers.l1(0.01),
activity_regularizer=tf.keras.regularizers.l2(0.05), activation='relu')(concat)
batch_norm = tf.keras.layers.Dropout(0.2)(dense)
output = Dense((len(K))(batch_norm)
model = Model(input_, output)
Multi-Layer CNN-1D with pre-padding Train and Val Loss
Multi-Layer CNN-1D with pre-padding Train and Val accuracy
Output: accuracy: 0.9078 ; val_accuracy: 0.9144