Wrong offset with nonword-prefix
Lingepumpe opened this issue · comments
Lingepumpe commented
Hi,
when I run:
>>> list(syntok.tokenize('..A'))
[<Token '' : '.' @ 0>, <Token '' : '.' @ 0>, <Token '' : 'A' @ 2>]
Here the first two tokens have the same offset. As I understand offsets this is not the intended behavior.
The problem can be fixed by adding "+i" in tokenizer.py:197, making the line:
yield Token("", c, mo.start()+i)
Florian Leitner commented
Indeed; Thanks for catching that naughty little bug! Will push a fix shortly.
Florian Leitner commented
Fixed with 6feb04c and in release v1.2.2
Thank you for reporting, and even more for tracking down the core issue!
That helped massively closing this ticket asap.