fullstorydev / uax29

A tokenizer based on Unicode text segmentation (UAX 29), for Go

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This package tokenizes words, sentences and graphemes, based on Unicode text segmentation (UAX 29), for Unicode version 13.0.0.

This is a fork off of github.com/clipperhouse/uax29/words. Modifcations have been made to the words package:

  • A max token length can be passed in. Tokens will be split upon hitting this limit.
  • Separators will be marked, so they can be omitted from the token stream if desired.

About

A tokenizer based on Unicode text segmentation (UAX 29), for Go

License:MIT License


Languages

Language:Go 100.0%