This is a simple fork of the famous Penn Treebank tokenizer. It is forked from DetectorMorse via NLTK.
- It is appropriate for English, but not other languages.
- It is appropriate when applied one sentence at a time, but should not be applied to paragraphs or documents.
Unlike the NLTK equivalent, it has no (library or data) dependencies except the
built-in re
. Unlike the NLTK
equivalent, it is not hostilely
polymorphic.