Senseless predictions on the 20_newsgroups dataset

Question

Senseless predictions on the 20_newsgroups dataset

dfesenko opened this issue 6 years ago · comments

I have an issue trying to perform text classification using the 20_newsgroups dataset, loaded from sklearn. Only 6 different newsgroups were selected for use in this case (so, I have only 6 labels).
I got very low accuracy on the test dataset. And then I have noticed that Magpie predict the same label for all inputs. Only confidence scores differ. When I play with a number of epochs and vector dimensions, the model starts to predict 2-3 different labels. But the performance is still very low (around 15% accuracy). What can be wrong here? The model which predicts the same output for any input is senseless.

I have texts in a variable X and labels in variable y. Then I create a folder data_six where I placed every text and every label in the separate .txt and .lab files using this code:

counter = 1
for i in range(len(X)):
     if y[i] in codes_to_leave:
        name_text = "data_six/" + str(counter) + ".txt"
        name_label = "data_six/" + str(counter) + ".lab"
        with open(name_text, 'w') as f1:
             f1.write(X[i])
        with open(name_label, 'w') as f2:
             f2.write(str(y[i]))
        counter += 1

Then I train word2vec and a model:

magpie.train_word2vec('data_six', vec_dim=300)
magpie.fit_scaler('data_six')
labels = ['comp.sys.mac.hardware', 'misc.forsale', 'rec.sport.hockey', 
          'sci.med', 'soc.religion.christian', 'talk.politics.mideast']
magpie.train('data_six', labels, test_ratio=0.2, epochs=10)

These are the outputs from the training process:

Train on 4691 samples, validate on 1173 samples
Epoch 1/10
4691/4691 [==============================] - 59s 13ms/step - loss: 0.0382 - top_k_categorical_accuracy: 0.8397 - val_loss: 3.2556e-06 - val_top_k_categorical_accuracy: 0.7101
Epoch 2/10
4691/4691 [==============================] - 58s 12ms/step - loss: 8.8344e-06 - top_k_categorical_accuracy: 0.7883 - val_loss: 3.1788e-06 - val_top_k_categorical_accuracy: 0.7153
Epoch 3/10
4691/4691 [==============================] - 58s 12ms/step - loss: 8.3641e-06 - top_k_categorical_accuracy: 0.7870 - val_loss: 3.1237e-06 - val_top_k_categorical_accuracy: 0.7306
Epoch 4/10
4691/4691 [==============================] - 58s 12ms/step - loss: 8.2958e-06 - top_k_categorical_accuracy: 0.7990 - val_loss: 3.0603e-06 - val_top_k_categorical_accuracy: 0.7442
Epoch 5/10
4691/4691 [==============================] - 58s 12ms/step - loss: 8.1809e-06 - top_k_categorical_accuracy: 0.8017 - val_loss: 2.9892e-06 - val_top_k_categorical_accuracy: 0.7621
Epoch 6/10
4691/4691 [==============================] - 59s 13ms/step - loss: 7.8731e-06 - top_k_categorical_accuracy: 0.8128 - val_loss: 2.9141e-06 - val_top_k_categorical_accuracy: 0.7724
Epoch 7/10
4691/4691 [==============================] - 58s 12ms/step - loss: 7.5711e-06 - top_k_categorical_accuracy: 0.8075 - val_loss: 2.8374e-06 - val_top_k_categorical_accuracy: 0.7877
Epoch 8/10
4691/4691 [==============================] - 59s 12ms/step - loss: 7.7605e-06 - top_k_categorical_accuracy: 0.7996 - val_loss: 2.7545e-06 - val_top_k_categorical_accuracy: 0.7988
Epoch 9/10
4691/4691 [==============================] - 58s 12ms/step - loss: 7.2885e-06 - top_k_categorical_accuracy: 0.8220 - val_loss: 2.6719e-06 - val_top_k_categorical_accuracy: 0.8107
Epoch 10/10
4691/4691 [==============================] - 62s 13ms/step - loss: 6.9731e-06 - top_k_categorical_accuracy: 0.8148 - val_loss: 2.5892e-06 - val_top_k_categorical_accuracy: 0.8252

Then I use the following function for making predictions and measuring accuracy:

def predict_and_evaluate(data_folder):
    filenames = os.listdir(data_folder)
    count_true = 0
    count_true_in_3 = 0
    count_all = 0
    for filename in filenames:
        if filename[-3:] == 'txt':
            count_all += 1
            prediction_list = magpie.predict_from_file('data/' + filename)
            first_prediction = max(prediction_list, key=lambda x:x[1])
            prediction_name = first_prediction[0]
            prediction_code = label_dict[prediction_name]    
            print(prediction_code)
            top3_preds = [i[0] for i in prediction_list[:3]]
            top3_codes = [label_dict[j] for j in top3_preds]
            with open('data_six/' + filename[:-3] + 'lab', 'r') as f:
                y_true = int(f.read())
            if y_true == prediction_code:
                count_true += 1
            if y_true in top3_codes:
                count_true_in_3 += 1      
    accuracy = float(count_true) / float(count_all) 
    accuracy_top_3 = float(count_true_in_3) / float(count_all)
    return accuracy, accuracy_top_3

And in the result, I get all outputs "misc.forsale" or "rec.sport.hockey" (here I mean that these categories get the highest probabilities for any input). And when I change the number of epochs and/or vector dimensions, there might be predicted other categories like soc.religion.christian, but all the same: for any input - the same prediction.

Can somebody tell me, please, what may be the reason of such weird behavior?