bplank / multilingualtokenizer

A trivial punctuation-based sentence splitter and tokenizer for multilingual data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

multilingualtokenizer

A trivial punctuation-based sentence splitter and tokenizer for multi-lingual data.

Requires python3 and the regex package. Install with pip install regex or conda install regex.

Usage:

python trivialssplitter.py FILE > OUTPUT.s
python tinytokenizer.py FILE > OUTPUT.s
python tinytokenizer.py --conll FILE > OUTPUT.t

The --conll option outputs one token per line. Default is to have one sentence per line.

About

A trivial punctuation-based sentence splitter and tokenizer for multilingual data.

License:MIT License


Languages

Language:Python 100.0%