cjer/mila_tokenizer

MILA tokenizer is licensed by mila@cs.technion.ac.il. I do not own their tokenizer whatsoever.

The run.py script is a simple (and very initial) wrapper for calling the tokenizer through python and creating a YAP compatible tokenized sentences output (one token per line, double linebreak between sentences).

MILA README
===========
Version 1.8 (20130314)

Using the tokenizer
-------------------

java [-Xmx1024m] -jar tokenizer.jar <input> <output>

Example 1: java -Xmx1024m -jar tokenizer.jar file.txt file.xml
Example 2: java -Xmx1024m -jar tokenizer.jar corpus/ corpus_tokenized/

Note: If using the directory mode, the output directory must already exist.
      Try "mkdir corpus_tokenized" before example 2.
cjer / mila_tokenizer

About

Languages