cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

segments

Build Status codecov PyPI

DOI

The segments package provides Unicode Standard tokenization routines and orthography segmentation, implementing the linear algorithm described in the orthography profile specification from The Unicode Cookbook (Moran and Cysouw 2018 DOI).

Command line usage

Create a text file:

$ echo "aäaaöaaüaa" > text.txt

Now look at the profile:

$ cat text.txt | segments profile
Grapheme        frequency       mapping
a       7       a
ä       1       ä
ü       1       ü
ö       1       ö

Write the profile to a file:

$ cat text.txt | segments profile > profile.prf

Edit the profile:

$ more profile.prf
Grapheme        frequency       mapping
aa      0       x
a       7       a
ä       1       ä
ü       1       ü
ö       1       ö

Now tokenize the text without profile:

$ cat text.txt | segments tokenize
a ä a a ö a a ü a a

And with profile:

$ cat text.txt | segments --profile=profile.prf tokenize
a ä aa ö aa ü aa

$ cat text.txt | segments --mapping=mapping --profile=profile.prf tokenize
a ä x ö x ü x

API

>>> from segments import Profile, Tokenizer
>>> t = Tokenizer()
>>> t('abcd')
'a b c d'
>>> prf = Profile({'Grapheme': 'ab', 'mapping': 'x'}, {'Grapheme': 'cd', 'mapping': 'y'})
>>> print(prf)
Grapheme	mapping
ab	x
cd	y
>>> t = Tokenizer(profile=prf)
>>> t('abcd')
'ab cd'
>>> t('abcd', column='mapping')
'x y'

About

Unicode Standard tokenization routines and orthography profile segmentation

License:Apache License 2.0


Languages

Language:Python 100.0%