aligrudi / neatvi

A small vi/ex editor for editing bidirectional UTF-8 text

Home Page:http://litcave.rudi.ir/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

re_groupcount() is bugged (counting nonsense brackets).

kyx0r opened this issue · comments

commented

Hello @aligrudi re_groupcount() is incorrectly computing the number of groups in the regex if the bracket expression has brackets inside of it. Regex engine does not support escaped brackets (and it should not because it's useless either way). Though that function didn't even try to check for escapes. But anyway, just delete that code it is redundant and causes problems.

The following patch simplifies the function and fixes the bug.

From ecdfce5414ba1c471a48f70f0ac691d7ea9953e5 Mon Sep 17 00:00:00 2001
From: Kyryl Melekhin <k.melekhin@gmail.com>
Date: Sun, 12 Sep 2021 10:09:02 +0000
Subject: [PATCH] fix re_groupcount() to only count groups

---
 rset.c | 21 +++------------------
 1 file changed, 3 insertions(+), 18 deletions(-)

diff --git a/rset.c b/rset.c
index 553b4fd..88672ba 100644
--- a/rset.c
+++ b/rset.c
@@ -15,25 +15,10 @@ struct rset {
 
 static int re_groupcount(char *s)
 {
-	int n = 0;
-	while (*s) {
-		if (s[0] == '(')
+	int n = *s == '(' ? 1 : 0;
+	while (*s++)
+		if (s[0] == '(' && s[-1] != '\\')
 			n++;
-		if (s[0] == '[') {
-			int dep = 0;
-			s += s[1] == '^' ? 3 : 2;
-			while (s[0] && (s[0] != ']' || dep)) {
-				if (s[0] == '[')
-					dep++;
-				if (s[0] == ']')
-					dep--;
-				s++;
-			}
-		}
-		if (s[0] == '\\' && s[1])
-			s++;
-		s++;
-	}
 	return n;
 }
 
-- 
2.33.0

Kind regards,
Kyryl.

commented

Good point about the "[(]" I guess we have to add one more check there then.

The problematic regex that I run into was this:

([^\t -,.-/:-@[-^{-~]+:).+;

To test put it into any syntax highlight ft, for example C, and see that everything breaks.

Well brackets can have nested classes but they still can't have groups. Or even if they could neatvi does not implement that, so still there is no reason to parse the bracket expression like it is currently.

commented

Though it seems like we will have to skip the bracket somehow, because even though ( can be escaped the slash will make it into the bracket exp. I'll see if I can implement it more cleanly then.

commented

I came up with scatch code that should skip all brackets. (Not tested)

static int re_groupcount(char *s)
{
        int brk = *s == '[' ? 1 : 0;
        int n = *s == '(' ? 1 : 0;
        while (*s++) {
                if (!brk && s[0] == '(' && s[-1] != '\\')
                        n++;
                else if (s[0] == '[' && s[-1] != '\\')
                        brk++;
                else if (s[0] == ']' && s[-1] != '\\')
                        brk--;
        }
        return n;
}

What do you think? Should work correctly and cover all cases right?

commented

Also has to check for itself, to maintain only the outermost brackets in tact, this is it?

static int re_groupcount(char *s)
{
        int brk = *s == '[' ? 1 : 0;
        int n = *s == '(' ? 1 : 0;
        while (*s++) {
                if (!brk && s[0] == '(' && s[-1] != '\\')
                        n++;
                else if (!brk && s[0] == '[' && s[-1] != '\\')
                        brk++;
                else if (brk && s[0] == ']' && s[-1] != '\\')
                        brk--;
        }
        return n;
}
commented

Seems to be good enough. This of course assumes that parenthesis and brackets are balanced, but well, if you don't too bad regex won't compile either way.

commented

Okay, screw this - it seems to be the cleanest solution is to implement bracket escapes into my regex engines afterall. I could parse and run proper nested looping and balance the ( and [ to accomodate all the edge cases but that alone is too much to take. On the other side, supporting escape in bracket is just 1 if statement.

commented

Hello Ali,

I don't know if this is any good.
In my forks, I just made escapes work inside the brackets. So you have to escape "(" inside brackets now (sometimes) for rset to correctly count the groups. I just leave that responsibility to the user, user has to know better when an escape is needed.

But, with that I have very clean codebase, my re_groupcount() is straight forward implementation, just counts non escaped "(".
See here regex.c, kyx0r/nextvi@739089d
The escape implementation in cheap, just one if statement (way better than doing all that stuff in your patch)

Commit history is a bit messy because I was working on 4 different things at the same time,
and this bug was really annoying and blocking in the way of things. I had to put escapes on some regexes in conf.c (I hope I didn't mess anything up there for hls like tex and those dirmarks, cause I don't know how they behaved before, seems stuff like

{+0, +1, 1, "\\\\\\*\\[([^]]+)\\]"},

is rather niche, I suppose you wanted the text to not be in reverse if it's inside \*[] ? The ([^]]+) part is ambiguous but I changed it to ([^\\]]+) now with escaped ] so it isn't ambiguous anymore (fun fact pikevm actually treated that expression like it was an empty [] regex before I implemented escapes it might even be bugged in your version of neatvi right now). And so I did similar changes to other exps in conf.

Also take a look at this commit: kyx0r/nextvi@f13348d

Yes, yes RIP bracket classes. I don't implement them anymore. Explained in the readme why.

At the end of the day, it's up to you to decide how you want to fix it in your version. I went with the simplest solution possible.

commented

I haven't read the posix manpage, specifically about the \ having special meaning. I will read it now.

But I mean, at this point, I don't really try to be 100% compliant to all nit picks in the standard.
Just trying to do all the things that make sense, ie. make the implementation cleaner and more maintainable by removing
code that isn't impactful/essential. Yes the downside may be that theoretically I might get different behaviors if I swap out regex engine to use the C library implementation for example. But I see no reason for ever doing so. My regex code is 3X faster than the Musl C regex library and is 630 LOC, while their's is like 3000+LOC. You can only imagine how bloated glibc regex is looking at musl.

One thing about \ is the code still allows you to get it, you just have to put it twice, one for escape and last one will be treated like regular character inside bracket.

commented

" all other special characters, including '', lose their special significance within a bracket expression." regex(7)
Ok, I misread your email. That's basically it, nothing new here. If I don't comply to that I guess the escapes will be treated as characters in other posix compliant engines.

commented

Okay I might of changed my mind on this topic again...
Ehh - so I was thinking in abstract how my proposed code does not correctly working on cases like ([a(s])-(d])
But the nature of this is ambiguous, does the bracket end at [a(s] or does it end till [a(s])-(d] ?
Given these expressions the regex engine will always pick the shortest bracket end, in this case
it is [a(s]. So the same principle applies to my algorithm I wrote there. That means it should always
come up with the same number of groups the regcomp will, shortest path possible. And that is the
problem we are trying to solve here. In case of bracket classes they always end with :] so : is that escape
character for them (sort of). I don't include *] and =] because neatvi does not use those.
Therefore adding support for them is a matter of adding one more && condition. But then
there will be another problem when [:] alone will still incorrectly count groups.

static int re_groupcount(char *s)
{
        int brk = *s == '[' ? 1 : 0;
        int n = *s == '(' ? 1 : 0;
        while (*s++) {
                if (!brk && s[0] == '(' && s[-1] != '\\')
                        n++;
                else if (!brk && s[0] == '[' && s[-1] != '\\' && s[1] != ':')
                        brk++;
                else if (brk && s[0] == ']' && s[-1] != '\\' && s[-1] != ':')
                        brk--;
        }
        return n;
}

Finally, enough thinking about this, or my brain will melt.
Truly makes me appreciate the simplicity of solution escapes in bracket provide.