Is it possible to use AnchorText with Tokenizer instead of CountVectorizer?
Enantiodromis opened this issue · comments
Good afternoon. Thank you for such a great package!
Is it possible to implement AnchorsText explainer with a model which takes in Tokenizer.texts_to_sequences data?
My current implementation:
Creating a reverse dictionary
reverse_word_map = dict(map(reversed, word_index.items()))
# Function takes a tokenized sentence and returns the words
def sequence_to_text(list_of_indices):
# Looking up words in dictionary
words = [reverse_word_map.get(letter) for letter in list_of_indices]
return words
my_texts = np.array(list(map(sequence_to_text, X_test_encoded)))
test_text = ' '.join(my_texts[4])
def wrapped_predict(strings):
print(strings)
cnn_rep = tokenizer.texts_to_sequences(strings)
text_data = pad_sequences(cnn_rep, maxlen=30)
print(text_data)
prediction = model.predict(text_data)
print(prediction)
return model.predict(text_data)
nlp = spacy.load('en_core_web_sm')
explainer = AnchorText(nlp, ['negative', 'positive'], use_unk_distribution=True)
exp = explainer.explain_instance(test_text, wrapped_predict, threshold=0.95)
And the current output is:
['war clan versus clan touch zenith final showdown bridge body count countless demons magic swords priests versus buddhist monks beautiful visions provided maestro rest good japanese flick rainy summer night']
[[ 181 6818 3962 6818 1039 19084 332 4277 2956 519 1415 3404
2136 1193 8736 8834 3962 14769 8249 197 5440 1925 15445 245
5 766 356 6073 1320 195]]
[[0.50682825]]
['UNK UNK UNK clan touch UNK final showdown bridge UNK UNK countless UNK UNK UNK priests UNK UNK monks beautiful UNK provided UNK rest UNK japanese UNK rainy UNK UNK']
[[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 6818 1039 332 4277 2956 3404 8834 8249 197 1925 245
766 6073]]
[[0.50716233]]
Error being thrown:
ValueError: all the input arrays must have the same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)
It appears like it is working...sort of I am not really sure how if possible I can work around this error, any help would be greatly appreciated.
Your wrapper is fine. The problem is that we expect wrapped_predict
to return an 1d array with integer predictions, and yours is returning a 2d (n, 1)
with the probability of class 1 (I'm guessing). Just make sure wrapped_predict(['a', 'b', 'c'])
returns something that looks like np.array([1, 0, 1])
Your wrapper is fine. The problem is that we expect
wrapped_predict
to return an 1d array with integer predictions, and yours is returning a 2d(n, 1)
with the probability of class 1 (I'm guessing). Just make surewrapped_predict(['a', 'b', 'c'])
returns something that looks likenp.array([1, 0, 1])
Thanks for the reply! Changing the wrapper implementation per your suggestion worked great.
The wrapper I am using now (Posting for anyone that might encounter similar trouble)
def wrapped_predict(strings):
cnn_rep = tokenizer.texts_to_sequences(strings)
text_data = pad_sequences(cnn_rep, maxlen=30)
prediction = model.predict(text_data)
predicted_class = np.where(prediction > 0.5, 1,0)[0]
return predicted_class