Tokenizer attribute .tokens_from_list deprecated

Question

Tokenizer attribute .tokens_from_list deprecated

fishcakebaker opened this issue 3 years ago · comments

The tokeniser attribute .tokens_from_list has been deprecated in SpaCy.

This is used in Chapter 7, Section 7.8 "Advanced Tokenisation, Stemming and Lemmatization" in block In[39].

I'm using SpaCy version 3.0.6 - Which I am guessing is several versions higher than the book, I just can't find where it is in my copy.

Any suggestions on getting around this function? I'm a bit of a newbie, but the searches online have led to rabbit holes thus far.

Tanvi Garg · Answer 1 · Tue Aug 31 2021 17:50:41 GMT+0800 (China Standard Time)

Instead using old_tokenizer.tokens_from_list, you can substitute any custom tokenizer that does the correct input -> Doc conversion with the correct vocab for nlp.tokenizer:

from spacy.tokens import Doc
class _PretokenizedTokenizer:
"""Custom tokenizer to be used in spaCy when the text is already pretokenized."""
def init(self, vocab: en_nlp):
"""Initialize tokenizer with a given vocab
:param vocab: an existing vocabulary
"""
self.vocab = vocab
for i in range(0,len(List)):
def call(self, inp: [List[i], str]) -> Doc:
"""Call the tokenizer on input inp.
:param inp: either a string to be split on whitespace, or a list of tokens
:return: the created Doc object
"""
if isinstance(inp, str):
words = inp.split()
spaces = [True] * (len(words) - 1) + ([True] if inp[-1].isspace() else [False])
return Doc(self.vocab, words=words, spaces=spaces)
elif isinstance(inp, list):
return Doc(self.vocab, words=inp)
else:
raise ValueError("Unexpected input format. Expected string to be split on whitespace, or list of tokens.")

Tanvi Garg · Answer 2 · Tue Aug 31 2021 18:11:08 GMT+0800 (China Standard Time)

'List' is used to store input string/text

Yves · Answer 3 · Wed Dec 14 2022 00:13:38 GMT+0800 (China Standard Time)

Probably similar to @Tanvi09Garg here is what works for me:

import re
import spacy
from spacy.tokens import Doc

# regexp used in CountVectorizer
# (?u) sets unicode flag, i.e. patterns are unicode
# \\b word boundary: the end of a word is indicated by whitespace or a non-alphanumeric character
# \\w alphanumeric: [0-9a-zA-Z_]

class RegexTokenizer:
    """Spacy custom tokenizer
        Reference https://spacy.io/usage/linguistic-features#custom-tokenizer
    """
    def __init__(self, vocab, regex_pattern='(?u)\\b\\w\\w+\\b'):
        self.vocab = vocab
        self.regexp = re.compile(regex_pattern)

    def __call__(self, text):
        words = self.regexp.findall(text)
        spaces = [True] * len(words)
        spaces[-1] = False #no space after last word

        return Doc(self.vocab, words=words, spaces=spaces)

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
nlp.tokenizer = RegexTokenizer(nlp.vocab)

def custom_tokenizer(document):
    doc_spacy = nlp(document)
    return [token.lemma_ for token in doc_spacy]

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(tokenizer=custom_tokenizer)

It runs a bit slow, any suggestions to speed this up?