amueller / introduction_to_ml_with_python

Notebooks and code for the book "Introduction to Machine Learning with Python"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tokenizer attribute .tokens_from_list deprecated

fishcakebaker opened this issue · comments

commented

The tokeniser attribute .tokens_from_list has been deprecated in SpaCy.

This is used in Chapter 7, Section 7.8 "Advanced Tokenisation, Stemming and Lemmatization" in block In[39].

I'm using SpaCy version 3.0.6 - Which I am guessing is several versions higher than the book, I just can't find where it is in my copy.

Any suggestions on getting around this function? I'm a bit of a newbie, but the searches online have led to rabbit holes thus far.

Instead using old_tokenizer.tokens_from_list, you can substitute any custom tokenizer that does the correct input -> Doc conversion with the correct vocab for nlp.tokenizer:

from spacy.tokens import Doc
class _PretokenizedTokenizer:
"""Custom tokenizer to be used in spaCy when the text is already pretokenized."""
def init(self, vocab: en_nlp):
"""Initialize tokenizer with a given vocab
:param vocab: an existing vocabulary
"""
self.vocab = vocab
for i in range(0,len(List)):
def call(self, inp: [List[i], str]) -> Doc:
"""Call the tokenizer on input inp.
:param inp: either a string to be split on whitespace, or a list of tokens
:return: the created Doc object
"""
if isinstance(inp, str):
words = inp.split()
spaces = [True] * (len(words) - 1) + ([True] if inp[-1].isspace() else [False])
return Doc(self.vocab, words=words, spaces=spaces)
elif isinstance(inp, list):
return Doc(self.vocab, words=inp)
else:
raise ValueError("Unexpected input format. Expected string to be split on whitespace, or list of tokens.")

'List' is used to store input string/text

commented

Probably similar to @Tanvi09Garg here is what works for me:

import re
import spacy
from spacy.tokens import Doc

# regexp used in CountVectorizer
# (?u) sets unicode flag, i.e. patterns are unicode
# \\b word boundary: the end of a word is indicated by whitespace or a non-alphanumeric character
# \\w alphanumeric: [0-9a-zA-Z_]

class RegexTokenizer:
    """Spacy custom tokenizer
        Reference https://spacy.io/usage/linguistic-features#custom-tokenizer
    """
    def __init__(self, vocab, regex_pattern='(?u)\\b\\w\\w+\\b'):
        self.vocab = vocab
        self.regexp = re.compile(regex_pattern)

    def __call__(self, text):
        words = self.regexp.findall(text)
        spaces = [True] * len(words)
        spaces[-1] = False #no space after last word

        return Doc(self.vocab, words=words, spaces=spaces)

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
nlp.tokenizer = RegexTokenizer(nlp.vocab)

def custom_tokenizer(document):
    doc_spacy = nlp(document)
    return [token.lemma_ for token in doc_spacy]

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(tokenizer=custom_tokenizer)

It runs a bit slow, any suggestions to speed this up?