rasbt / python-machine-learning-book

The "Python Machine Learning (1st edition)" book code repository and info resource

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error in chapter 8 code

opened this issue · comments

def tokenizer(text):
    return text.split()

from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

from nltk.corpus import stopwords
stop = stopwords.words('english')

X_train = df.loc[:25000,'review'].values
y_train = df.loc[:25000,'sentiment'].values
X_test = df.loc[25000:,'review'].values
y_test = df.loc[25000:,'sentiment'].values
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
if Version(sklearn_version) < '0.18':
    from sklearn.grid_search import GridSearchCV
else:
    from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)
gs_lr_tfidf.fit(X_train,y_train)

Hi,
I get an error about "can't get attribute tokenizer_porter" ,
what's the problem you think?

Hi there,
I just tried to run this code on the movie data

after prepending a

import pandas as pd
df = pd.read_csv('./movie_data.csv')

and it works fine for me (sklearn 0.17 and sklearn 0.18).
So, is this the exact code you are running, and could you maybe share the exact error message, e.g., to see in which line in occurs?

I'm running into the same issue, I believe. I tried running the below code. But, I'm getting "can't get attribute 'tokenizer_porter' and 'tokenizer'.

Here's my version of python and sklearn respectively,

'3.5.2 |Anaconda 4.1.1 (32-bit)| (default, Jul 5 2016, 11:45:57) [MSC v.1900 32 bit (Intel)]
'0.17.1'

Here's a screenshot of one of the stack traces:

image

import pandas as pd
df = pd.read_csv('./movie_data.csv')

def tokenizer(text):
    return text.split()

from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

from nltk.corpus import stopwords
stop = stopwords.words('english')

X_train = df.loc[:25000,'review'].values
y_train = df.loc[:25000,'sentiment'].values
X_test = df.loc[25000:,'review'].values
y_test = df.loc[25000:,'sentiment'].values
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
if Version(sklearn_version) < '0.18':
    from sklearn.grid_search import GridSearchCV
else:
    from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)
gs_lr_tfidf.fit(X_train,y_train)

Thanks, I will take a look at it!

I just ran this example in scikit-learn 0.17.1 and don't have any issues (see screenshot). I am wondering if this has something to do with Windows & multiprocessing (heard it's a bit trickier on that platform). Unfortunately, I don't own a copy of Windows and only have Mac and Linux machines to test it on.

screen shot 2016-10-17 at 1 01 02 pm

  1. Could you try this again with n_jobs=1? I would also pick a smaller dataset so that it's faster, maybe df = pd.read_csv('./movie_data.csv', nrows=1000)

  2. Could you try running this via sklearn 0.18? Maybe the bug was already fixed. If not, we should file it (or a simpler version) as a bug on scikit-learn's Issue tracker

I've tried option 1 with 1000 rows and with the completed dataset and that seems to works (see screenshots below). I'll try running via sklearn 0.18 later today and see if the issue is reproducible.

image

image

Glad to hear that it works at least! When I see it correctly, changing n_jobs=-1 to n_jobs=1 made it work? Could you maybe try it on a different dataset/simplified setup. E.g., trying both n_jobs=-1 to n_jobs=1 in the following (works fine with both on my machine):

from sklearn.datasets import fetch_20newsgroups
from nltk.stem.porter import PorterStemmer
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
if Version(sklearn_version) < '0.18':
    from sklearn.grid_search import GridSearchCV
else:
    from sklearn.model_selection import GridSearchCV

porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']

twenty_train = fetch_20newsgroups(subset='train',
    categories=categories, shuffle=True, random_state=42)

X, y = twenty_train.data[:300], twenty_train.target[:300]

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

param_grid = {'vect__ngram_range': [(1, 1)],
               'vect__tokenizer': [tokenizer_porter],
               'clf__C': [1.0, 100.0]}


gs_lr_tfidf = GridSearchCV(lr_tfidf, 
                           param_grid,
                           scoring='accuracy',
                           cv=3,
                           verbose=1,
                           n_jobs=1)

gs_lr_tfidf.fit(X, y)

I am closing this now in hope you got it to work okay! Otherwise, please let me know, and I'd be more than happy to reopen this issue to discuss the problem further.

I'm also having the exact issue but to your point, I'm using Windows 10... I'll see what I can do about installing Linux. Thanks for pointing that out!

Sorry to hear that there still seems to be an issue with recent Windows versions. Does it work okay if you set n_jobs=1?

Thanks for the response. I just installed Ubuntu in VirtualBox and it worked fine after that!

I have been running into issues with Windows for while now with learning AI, so I'm glad I installed Linux finally. But one issue with using VirtualBox is graphics are very choppy so running problems like cart-pole was not looking good.

I am glad to hear that everything's working fine on Linux now. I have never experimented with this since my machines are either Linux or macOS-only machines, but I think you can have both Linux and Windows installed on the same machine via a dual-boot mode or so.