Is it possible to use AnchorText with Tokenizer instead of CountVectorizer?

Question

Is it possible to use AnchorText with Tokenizer instead of CountVectorizer?

Enantiodromis opened this issue 3 years ago · comments

George Bradley commented 3 years ago

Good afternoon. Thank you for such a great package!

Is it possible to implement AnchorsText explainer with a model which takes in Tokenizer.texts_to_sequences data?

My current implementation:

Creating a reverse dictionary

reverse_word_map = dict(map(reversed, word_index.items()))

# Function takes a tokenized sentence and returns the words
def sequence_to_text(list_of_indices):
    # Looking up words in dictionary
    words = [reverse_word_map.get(letter) for letter in list_of_indices]
    return words
my_texts = np.array(list(map(sequence_to_text, X_test_encoded)))
test_text = ' '.join(my_texts[4])

def wrapped_predict(strings):
    print(strings)
    cnn_rep = tokenizer.texts_to_sequences(strings)
    text_data = pad_sequences(cnn_rep, maxlen=30)
    print(text_data)
    prediction = model.predict(text_data)
    print(prediction)
    return model.predict(text_data)

nlp = spacy.load('en_core_web_sm')
explainer = AnchorText(nlp, ['negative', 'positive'], use_unk_distribution=True)
exp = explainer.explain_instance(test_text, wrapped_predict, threshold=0.95)

And the current output is:

['war clan versus clan touch zenith final showdown bridge body count countless demons magic swords priests versus buddhist monks beautiful visions provided maestro rest good japanese flick rainy summer night']
[[ 181 6818 3962 6818 1039 19084 332 4277 2956 519 1415 3404
2136 1193 8736 8834 3962 14769 8249 197 5440 1925 15445 245
5 766 356 6073 1320 195]]
[[0.50682825]]
['UNK UNK UNK clan touch UNK final showdown bridge UNK UNK countless UNK UNK UNK priests UNK UNK monks beautiful UNK provided UNK rest UNK japanese UNK rainy UNK UNK']
[[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 6818 1039 332 4277 2956 3404 8834 8249 197 1925 245
766 6073]]
[[0.50716233]]

Error being thrown:

ValueError: all the input arrays must have the same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)

It appears like it is working...sort of I am not really sure how if possible I can work around this error, any help would be greatly appreciated.

Marco Tulio Correia Ribeiro · Answer 1 · Tue Aug 10 2021 05:21:44 GMT+0800 (China Standard Time)

Your wrapper is fine. The problem is that we expect wrapped_predict to return an 1d array with integer predictions, and yours is returning a 2d (n, 1) with the probability of class 1 (I'm guessing). Just make sure wrapped_predict(['a', 'b', 'c']) returns something that looks like np.array([1, 0, 1])

George Bradley · Answer 2 · Sun Aug 15 2021 17:58:38 GMT+0800 (China Standard Time)

Your wrapper is fine. The problem is that we expect wrapped_predict to return an 1d array with integer predictions, and yours is returning a 2d (n, 1) with the probability of class 1 (I'm guessing). Just make sure wrapped_predict(['a', 'b', 'c']) returns something that looks like np.array([1, 0, 1])

Thanks for the reply! Changing the wrapper implementation per your suggestion worked great.

The wrapper I am using now (Posting for anyone that might encounter similar trouble)

    def wrapped_predict(strings):
        cnn_rep = tokenizer.texts_to_sequences(strings)
        text_data = pad_sequences(cnn_rep, maxlen=30)
        prediction = model.predict(text_data)
        predicted_class = np.where(prediction > 0.5, 1,0)[0]
        return predicted_class