ashinn / irregex

Portable Efficient IrRegular Expressions for Scheme

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

irregex-match-substring using subpattern under kleene star can return extra chars

jbclements opened this issue · comments

In the following example that uses SRE syntax, a named sub-pattern, and a kleene star (I think the or might be necessary too), the string returned by irregex-match-substring looks like it goes from the beginning of the first match to the end of the last match (including all chars in between).

pajaro2:/tmp/irregex (git)-[master]- clements> scheme
Chez Scheme Version 9.5.4
Copyright 1984-2020 Cisco Systems, Inc.

> (import (irregex))
> (define subpat2 '(=> subby (: "@" alphanum)))
> (define pat2 `(* (or alphanum ,subpat2)))
> (define match2 (irregex-match pat2 "oeh@2tu@2n342"))
> (irregex-match-names match2)
((subby . 1))
> (irregex-match-num-submatches match2)
1
> (irregex-match-substring match2 1)
"@2tu@2”

This should follow the leftmost longest rule, i.e. the first match should be returned. It's not very intuitive to pick out a single instance of the matches in a kleene star, which is why I suspect this hasn't come up before.

Note it works correctly in the backtracking path so is specific to the tNFA construction. The tNFA should preserve the first end position after the submatch has matched once. @sjamaan might be the better person to look at this.

I will look into this when I find some quiet time

Thanks! And I don't mean to put you on the spot - I'll look at it eventually if it remains unfixed :)

Of course, no worries! I have to get started with $DAYJOB now, but after some quick checking, I've discovered:

a) if I disable command reordering (i.e., I comment out the second argument to the or in find-reorder-commands), the code works perfectly. As the comments say, I expected bugs in that code. Now I just need to re-grok how it works and is supposed to work ;)
b) Either way, the dfa looks quite large. This probably has to do with the way the tag and memory slots operate, but it's a bit strange to watch a relatively small nfa explode into a much bigger dfa.

Ah, it turns out the bug is in the reordering commands themselves: they must be ordered in such a way that swappings are allowed. For example, if two states are identical except that memory slots 0 and 1 are swapped, we now emit commands like this:

p[0] = p[1]
p[1] = p[0]

Of course this won't work. We'll need to read all slots and then write them. I have committed a trivial patch for it which memoizes all the old values in a closure before executing all the updates. It's not pretty but should work fine. Please have a look at the latest version!