adrianeboyd / custom-cython-tokenizer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A minimal custom Cython spaCy tokenizer

This package demonstrates how to create a registered custom tokenizer that extends the spaCy Tokenizer in Cython for use with spaCy v3.

Install

pip install -U pip
pip install .

Or from the repo URL:

pip install -U pip
pip install https://github.com/adrianeboyd/custom-cython-tokenizer/archive/master.zip

Usage

Once this package is installed, the custom tokenizer is registered under the entry point spacy_tokenizers, so you can specify it your config like this:

[nlp]
tokenizer = {"@tokenizers":"custom_tokenizer.v1"}

Or start from a blank model in python:

import spacy

nlp = spacy.blank("en", config={"nlp": {"tokenizer": {"@tokenizers": "custom_tokenizer.v1"}}})

Packaging a pipeline

If your packaged spaCy pipeline requires this package, specify it in meta.json under requirements before calling spacy package:

  "requirements":[
    "custom-tokenizer>=0.0.2,<0.1.0"
  ]

About

License:MIT License


Languages

Language:Python 61.5%Language:Cython 38.5%