joelthe1 / ulf-tokenizer

Tokenizer developed by Ulf Harmjakob @ USC ISI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ulf's Tokenizer

Tokenizer tool developed by Ulf Harmjakob @ USC ISI (so we call it ulf's tokenizer)

Usage

for english or latin scripts:

cat input.txt | ulf-eng-tok.sh > input.tok.txt

for non latin scripts:

cat input.txt | ulf-src-tok.sh > input.tok.txt 
Python API

This is a python wrapper which uses a subprocess for tokenizer communicated using stdin and stdout

Here is how to use it:

# export PYTHONPATH=$PWD

from ulftok import tokenize_lines
text = "Hello,... this is a test! Is it good? http://isi.edu"
lines = [text] * 10
for line in tokenize_lines(lines):
    print(line)

About

Tokenizer developed by Ulf Harmjakob @ USC ISI

License:Apache License 2.0


Languages

Language:Perl 60.5%Language:Perl 6 37.3%Language:Python 1.6%Language:Shell 0.6%