stanfordnlp / stanfordnlp

[Deprecated] This library has been renamed to "Stanza". Latest development at: https://github.com/stanfordnlp/stanza

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RuntimeError: "index_select_out_cuda_impl" not implemented for 'Float'

AnyaMit opened this issue · comments

commented

Describe the bug
Getting a RuntimeError: "index_select_out_cuda_impl" not implemented for 'Float'

To Reproduce
##dataset

Download the zip file

path_to_zip = tf.keras.utils.get_file("smsspamcollection.zip",origin="https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip",extract=True)

Unzip the file int a folder

!unzip $path_to_zip -d data

spam_dataset = []
for line in lines:
label, text = line.split('\t')
if label.strip() == 'spam':
spam_dataset.append((1, text.strip()))
else:
spam_dataset.append((0, text.strip()))
print(spam_dataset)

#process the df
import pandas as pd

df = pd.DataFrame(spam_dataset, columns=['Spam','Message'])

import re

def message_length(x):
return len(x)

def num_capitals(x):
_, count = re.subn(r'[A-Zz]', '', x) # only works in english
return count

def num_punctuation(x):
_, count = re.subn(r'\W', '', x)
return count

df['Capitals'] = df['Message'].apply(num_capitals)
df['Punctuation'] = df['Message'].apply(num_punctuation)
df['Length'] = df['Message'].apply(message_length)
df.describe()

Print out of the df

Spam Capitals Punctuation Length
count 5574.000000 5574.000000 5574.000000 5574.000000
mean 0.134015 5.706315 18.942591 80.443488
std 0.340699 11.720229 14.825994 59.841746
min 0.000000 0.000000 0.000000 2.000000
25% 0.000000 1.000000 8.000000 36.000000
50% 0.000000 2.000000 15.000000 61.000000
75% 0.000000 4.000000 27.000000 122.000000
max 1.000000 129.000000 253.000000 910.000000

Now we want to add a new column with tokenized words - we use snlp for this

!pip install stanfordnlp as snlp
import stanfordnlp as snlp
en = snlp.download('en')
en = snlp.Pipeline(lang='en', processors='tokenize')

tokenized = en(sentence)
len(tokenized.sentences)

for snt in tokenized.sentences:
for word in snt.tokens:
print(word.text)
print("")

en = snlp.Pipeline(lang='en')
print(en)

##Function which does not work with float
def word_counts(x, pipeline=en):
doc = pipeline(x)
count = sum([len(sentence.tokens) for sentence in doc.sentences])
return count

train['Words'] = train['Message'].apply(word_counts)
test['Words'] = test['Message'].apply(word_counts)

ISSUE - Error print out

/usr/local/lib/python3.7/dist-packages/stanfordnlp/models/depparse/model.py:157: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at /pytorch/aten/src/ATen/native/cuda/LegacyDefinitions.cpp:28.)
unlabeled_scores.masked_fill_(diag, -float('inf'))

RuntimeError Traceback (most recent call last)
in ()
4 # unlabeled_scores.masked_fill_(diag, -float('inf'))
5
----> 6 train['Words'] = train['Message'].apply(word_counts)
7 test['Words'] = test['Message'].apply(word_counts)
8

7 frames
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

/usr/local/lib/python3.7/dist-packages/stanfordnlp/models/common/seq2seq_model.py in update_state(states, idx, positions, beam_size)
191 br, d = e.size()
192 s = e.contiguous().view(beam_size, br // beam_size, d)[:,idx]
--> 193 s.data.copy_(s.data.index_select(0, positions))
194
195 # (3) main loop

RuntimeError: "index_select_out_cuda_impl" not implemented for 'Float'

Expected behavior
The line - train['Words'] = train['Message'].apply(word_counts) should add a column named 'Words' which applies the word_counts function to the sentences.

Spam Capitals Punctuation Length Words

Environment (please complete the following information):

  • OS: [Windows]
  • Python version: [Python 3.6.9 - using Google Colab]
  • StanfordNLP version: [0.2.0]

Additional context
Using the examples from the book Advanced Natural Language Processing with TensorFlow 2 by Ashish Bansal