Difference with machine/packrat parsers and captures
goodmami opened this issue · comments
Without captures, a repetition operator has the same behavior with the packrat and machine parsers:
>>> import pe
>>> pe.match('"a"+', 'aaa', parser="packrat")
<Match object; span=(0, 3), match='aaa'>
>>> pe.match('"a"+', 'aaa', parser="machine")
<Match object; span=(0, 3), match='aaa'>
But when captures are used, they differ:
>>> pe.match('~"a"+', 'aaa', parser="packrat")
<Match object; span=(0, 3), match='aaa'>
>>> pe.match('~"a"+', 'aaa', parser="machine")
<Match object; span=(0, 1), match='a'>
This issue only affects the cython machine parser. The issue is that repetitions with non-zero minimum occurrences reuse the parsing instruction object for each occurrence (below, pis
is a list of parsing instructions):
Lines 405 to 408 in 4167657
When the capture modifies the instruction to mark the start of a capture, it marks all of them. The Python machine parser is not susceptible to this bug because it uses immutable tuples instead of Instruction
objects (at least for now).