Error when using save_to_file() - IndexError: [E040] Attempt to access token at x, max length x.
Enantiodromis opened this issue · comments
Hey!
I am encountering an error that I cannot seem to resolve. The error does not occur every time I run an explanation but more often than not. The error occurs when using the save_to_file function.
CODE SNIPPET
####################
# ANCHOR EXPLAINER #
####################
def anchor_explainer(X_test_encoded, model, word_index, tokenizer):
# Creating a reverse dictionary
reverse_word_map = dict(map(reversed, word_index.items()))
# Function takes a tokenized sentence and returns the words
def sequence_to_text(list_of_indices):
# Looking up words in dictionary
words = [reverse_word_map.get(letter) for letter in list_of_indices]
return words
my_texts = np.array(list(map(sequence_to_text, X_test_encoded)))
def wrapped_predict(strings):
cnn_rep = tokenizer.texts_to_sequences(strings)
text_data = pad_sequences(cnn_rep, maxlen=30)
prediction = model.predict(text_data)
predicted_class = np.where(prediction > 0.5, 1,0)[0]
return predicted_class
test_text = ' '.join(my_texts[6])
nlp = spacy.load('en_core_web_sm')
explainer = AnchorText(nlp, ['negative', 'positive'], use_unk_distribution=True)
exp = explainer.explain_instance(test_text, wrapped_predict, threshold=0.95)
exp.save_to_file("text_explanations/anchors_text_explanations/lime_test_data3.html", )
ERROR MESSAGE
Traceback (most recent call last):
File "c:/Users/.../Documents/Visual Studio Code Workspace/xai_classification_mixed_data/code/text_classification/anchor_text_explanation.py", line 66, in <module>
anchor_explainer(X_test, model, word_index, tokenizer)
File "c:/Users/.../Documents/Visual Studio Code Workspace/xai_classification_mixed_data/code/text_classification/anchor_text_explanation.py", line 42, in anchor_explainer
exp.save_to_file("text_explanations/anchors_text_explanations/lime_test_data3.html", )
File "C:\Users\...\Anaconda3\envs\shap_text\lib\site-packages\anchor\anchor_explanation.py", line 108, in save_to_file
out = self.as_html(**kwargs)
File "C:\Users\...\Anaconda3\envs\shap_text\lib\site-packages\anchor\anchor_explanation.py", line 100, in as_html
return self.as_html_fn(self.exp_map, **kwargs)
File "C:\Users\...\Anaconda3\envs\shap_text\lib\site-packages\anchor\anchor_text.py", line 219, in as_html
example_obj.append(process_examples(examples, i))
File "C:\Users\...\Anaconda3\envs\shap_text\lib\site-packages\anchor\anchor_text.py", line 212, in process_examples
raw_indexes = [(processed[i].text, processed[i].idx, exp['prediction']) for i in idxs]
File "C:\Users\...\Anaconda3\envs\shap_text\lib\site-packages\anchor\anchor_text.py", line 212, in <listcomp>
raw_indexes = [(processed[i].text, processed[i].idx, exp['prediction']) for i in idxs]
File "spacy\tokens\doc.pyx", line 463, in spacy.tokens.doc.Doc.__getitem__
File "spacy\tokens\token.pxd", line 23, in spacy.tokens.token.Token.cinit
IndexError: [E040] Attempt to access token at 26, max length 26.
I am not sure if there is an error that lies with my implementation or not... perhaps a tokenizer issue?
Any insight as always would be greatly appreciated!
Thanks!
I got the same error as @Enantiodromis and tried to fix it as shown in the above commit.
The error is raised in the following line:
raw_indexes = [(processed[i].text, processed[i].idx, exp['prediction']) for i in idxs]
in cases where the input string is truncated (see commit details), thus leading to strings of fewer tokens than the original input string. When the anchor index is the last token (or one of the last tokens) of the input, then such token is lost, thus raising the above IndexError ("Attempt to access token at 26, max length 26").