goodmami / pe

Fastest general-purpose parsing library for Python with a familiar API

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Difference with machine/packrat parsers and captures

goodmami opened this issue · comments

Without captures, a repetition operator has the same behavior with the packrat and machine parsers:

>>> import pe
>>> pe.match('"a"+', 'aaa', parser="packrat")
<Match object; span=(0, 3), match='aaa'>
>>> pe.match('"a"+', 'aaa', parser="machine")
<Match object; span=(0, 3), match='aaa'>

But when captures are used, they differ:

>>> pe.match('~"a"+', 'aaa', parser="packrat")
<Match object; span=(0, 3), match='aaa'>
>>> pe.match('~"a"+', 'aaa', parser="machine")
<Match object; span=(0, 1), match='a'>

This issue only affects the cython machine parser. The issue is that repetitions with non-zero minimum occurrences reuse the parsing instruction object for each occurrence (below, pis is a list of parsing instructions):

pe/pe/_cy_machine.pyx

Lines 405 to 408 in 4167657

return [*(pis * mincount),
Instruction(BRANCH, len(pis) + 2),
*pis,
Instruction(UPDATE, -len(pis))]

When the capture modifies the instruction to mark the start of a capture, it marks all of them. The Python machine parser is not susceptible to this bug because it uses immutable tuples instead of Instruction objects (at least for now).