fnl / syntok

Text tokenization and sentence segmentation (segtok v2)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wrong offset with nonword-prefix

Lingepumpe opened this issue · comments

Hi,

when I run:

>>> list(syntok.tokenize('..A'))
[<Token '' : '.' @ 0>, <Token '' : '.' @ 0>, <Token '' : 'A' @ 2>]

Here the first two tokens have the same offset. As I understand offsets this is not the intended behavior.

The problem can be fixed by adding "+i" in tokenizer.py:197, making the line:

yield Token("", c, mo.start()+i)

Indeed; Thanks for catching that naughty little bug! Will push a fix shortly.

Fixed with 6feb04c and in release v1.2.2

Thank you for reporting, and even more for tracking down the core issue!
That helped massively closing this ticket asap.