Error in chapter 8 code

Question

Error in chapter 8 code

opened this issue 8 years ago · comments

def tokenizer(text):
    return text.split()

from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

from nltk.corpus import stopwords
stop = stopwords.words('english')

X_train = df.loc[:25000,'review'].values
y_train = df.loc[:25000,'sentiment'].values
X_test = df.loc[25000:,'review'].values
y_test = df.loc[25000:,'sentiment'].values
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
if Version(sklearn_version) < '0.18':
    from sklearn.grid_search import GridSearchCV
else:
    from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)
gs_lr_tfidf.fit(X_train,y_train)

Hi,
I get an error about "can't get attribute tokenizer_porter" ,
what's the problem you think?

Sebastian Raschka · Answer 1 · Mon Oct 03 2016 23:13:54 GMT+0800 (China Standard Time)

Hi there,
I just tried to run this code on the movie data

after prepending a

import pandas as pd
df = pd.read_csv('./movie_data.csv')

and it works fine for me (sklearn 0.17 and sklearn 0.18).
So, is this the exact code you are running, and could you maybe share the exact error message, e.g., to see in which line in occurs?

dataxerik · Answer 2 · Tue Oct 18 2016 00:40:06 GMT+0800 (China Standard Time)

I'm running into the same issue, I believe. I tried running the below code. But, I'm getting "can't get attribute 'tokenizer_porter' and 'tokenizer'.

Here's my version of python and sklearn respectively,

'3.5.2 |Anaconda 4.1.1 (32-bit)| (default, Jul 5 2016, 11:45:57) [MSC v.1900 32 bit (Intel)]
'0.17.1'

Here's a screenshot of one of the stack traces:

import pandas as pd
df = pd.read_csv('./movie_data.csv')

def tokenizer(text):
    return text.split()

from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

from nltk.corpus import stopwords
stop = stopwords.words('english')

X_train = df.loc[:25000,'review'].values
y_train = df.loc[:25000,'sentiment'].values
X_test = df.loc[25000:,'review'].values
y_test = df.loc[25000:,'sentiment'].values
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
if Version(sklearn_version) < '0.18':
    from sklearn.grid_search import GridSearchCV
else:
    from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)
gs_lr_tfidf.fit(X_train,y_train)

Sebastian Raschka · Answer 3 · Tue Oct 18 2016 00:53:46 GMT+0800 (China Standard Time)

Thanks, I will take a look at it!

Sebastian Raschka · Answer 4 · Tue Oct 18 2016 01:01:57 GMT+0800 (China Standard Time)

I just ran this example in scikit-learn 0.17.1 and don't have any issues (see screenshot). I am wondering if this has something to do with Windows & multiprocessing (heard it's a bit trickier on that platform). Unfortunately, I don't own a copy of Windows and only have Mac and Linux machines to test it on.

Could you try this again with n_jobs=1? I would also pick a smaller dataset so that it's faster, maybe df = pd.read_csv('./movie_data.csv', nrows=1000)
Could you try running this via sklearn 0.18? Maybe the bug was already fixed. If not, we should file it (or a simpler version) as a bug on scikit-learn's Issue tracker

dataxerik · Answer 5 · Tue Oct 18 2016 05:13:44 GMT+0800 (China Standard Time)

I've tried option 1 with 1000 rows and with the completed dataset and that seems to works (see screenshots below). I'll try running via sklearn 0.18 later today and see if the issue is reproducible.

Sebastian Raschka · Answer 6 · Tue Oct 18 2016 06:11:44 GMT+0800 (China Standard Time)

Glad to hear that it works at least! When I see it correctly, changing n_jobs=-1 to n_jobs=1 made it work? Could you maybe try it on a different dataset/simplified setup. E.g., trying both n_jobs=-1 to n_jobs=1 in the following (works fine with both on my machine):

from sklearn.datasets import fetch_20newsgroups
from nltk.stem.porter import PorterStemmer
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
if Version(sklearn_version) < '0.18':
    from sklearn.grid_search import GridSearchCV
else:
    from sklearn.model_selection import GridSearchCV

porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']

twenty_train = fetch_20newsgroups(subset='train',
    categories=categories, shuffle=True, random_state=42)

X, y = twenty_train.data[:300], twenty_train.target[:300]

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

param_grid = {'vect__ngram_range': [(1, 1)],
               'vect__tokenizer': [tokenizer_porter],
               'clf__C': [1.0, 100.0]}


gs_lr_tfidf = GridSearchCV(lr_tfidf, 
                           param_grid,
                           scoring='accuracy',
                           cv=3,
                           verbose=1,
                           n_jobs=1)

gs_lr_tfidf.fit(X, y)

Sebastian Raschka · Answer 7 · Mon Dec 05 2016 12:13:34 GMT+0800 (China Standard Time)

I am closing this now in hope you got it to work okay! Otherwise, please let me know, and I'd be more than happy to reopen this issue to discuss the problem further.

osmanmeer · Answer 8 · Fri Mar 24 2017 09:54:04 GMT+0800 (China Standard Time)

I'm also having the exact issue but to your point, I'm using Windows 10... I'll see what I can do about installing Linux. Thanks for pointing that out!

Sebastian Raschka · Answer 9 · Fri Mar 24 2017 10:42:03 GMT+0800 (China Standard Time)

Sorry to hear that there still seems to be an issue with recent Windows versions. Does it work okay if you set n_jobs=1?

osmanmeer · Answer 10 · Mon Mar 27 2017 03:47:31 GMT+0800 (China Standard Time)

Thanks for the response. I just installed Ubuntu in VirtualBox and it worked fine after that!

I have been running into issues with Windows for while now with learning AI, so I'm glad I installed Linux finally. But one issue with using VirtualBox is graphics are very choppy so running problems like cart-pole was not looking good.

Sebastian Raschka · Answer 11 · Mon Apr 10 2017 00:07:35 GMT+0800 (China Standard Time)

I am glad to hear that everything's working fine on Linux now. I have never experimented with this since my machines are either Linux or macOS-only machines, but I think you can have both Linux and Windows installed on the same machine via a dual-boot mode or so.