kylebgorman / ptbtok

Penn Treebank tokenizer with no dependencies

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Penn Treebank tokenizer

This is a simple fork of the famous Penn Treebank tokenizer. It is forked from DetectorMorse via NLTK.

  • It is appropriate for English, but not other languages.
  • It is appropriate when applied one sentence at a time, but should not be applied to paragraphs or documents.

Unlike the NLTK equivalent, it has no (library or data) dependencies except the built-in re. Unlike the NLTK equivalent, it is not hostilely polymorphic.

About

Penn Treebank tokenizer with no dependencies

License:Apache License 2.0


Languages

Language:Python 100.0%