Wrong offset with nonword-prefix

Question

Lingepumpe opened this issue 5 years ago · comments

Hi,

when I run:

>>> list(syntok.tokenize('..A'))
[<Token '' : '.' @ 0>, <Token '' : '.' @ 0>, <Token '' : 'A' @ 2>]

Here the first two tokens have the same offset. As I understand offsets this is not the intended behavior.

The problem can be fixed by adding "+i" in tokenizer.py:197, making the line:

yield Token("", c, mo.start()+i)

Florian Leitner · Answer 1 · Mon Nov 11 2019 22:03:51 GMT+0800 (China Standard Time)

Indeed; Thanks for catching that naughty little bug! Will push a fix shortly.

Florian Leitner · Answer 2 · Mon Nov 11 2019 22:53:21 GMT+0800 (China Standard Time)

Fixed with 6feb04c and in release v1.2.2

Thank you for reporting, and even more for tracking down the core issue!
That helped massively closing this ticket asap.