POSIX captures processing

Question

POSIX captures processing

dtp555-1212 opened this issue a year ago · comments

I have a ‘re’ file that processes reasonably quick and without error when I don’t use the ‘-P’ flags to enable POSIX captures, but when enabling it, I notice 2 behaviors… It gets substantially slower (e.g. minutes vs seconds) and in the worst case I am getting a crash due to ‘bad_alloc’).

The third behavior is that when the POSIX captures are enabled, a message is issue that ‘implicit groupings’ is forbidden. This leads me to a theoretical enhancement, that may lead to a speed improvement and smaller results as well.

I ‘think’ that the implicit group rule is arbitrary. I understand why such a thing would exist, but there may be a way to accomplish both.

A named definition e.g.
num = [0-9]+;

num { return 1; }

is way to facilitate reuse and readability, rather than …

[0-9]+ { return 1; }

Using a named definition has other benefits as well, as it also conceptually serves as way to ‘group’ without using the () and their defined POSIX capture meaning.

In the case of the POSIX captures, by enforcing the explicit grouping, I think it has a wasteful side effect. (e.g. it ‘must’ track the subpattern start and stop for every named definition).

Since it is possible to create a valid and unambiguous grammar, without the extra explicit grouping (e.g. num = ([0-0+); ) … Forcing the explicit, takes away some potential. Using the () only when you want to explicitly gather the substring would reduce the size of the output, speed the processing, and reduce the size of the yypmatch to only what is desired to be saved.

For example…
num = [0-9]+;

(num) ‘ ‘ (num) { return 1; }

Using the () only when and where you want to capture, gives full control, and reduces the waste. Of course, this is a toy example, but you can see the ‘big’ negative effect even with a relatively small grammar (e.g. your unicode_indentifier.re example)… without the -P it processes in seconds, but with the -P (and after you add the () to satisfy the program it takes minutes… it also, results in 6 yynmatch (since it is tracking all the subpatterns; which in this case are only there to facilitate the definition and not a desired ‘keeper’ sub pattern) rather than just the 1 that bounds the identifier.

I love what you have done, and hopefully such a change is possible, and it can result in a dramatic speed up of both re2c processing as well as the runtime result.

Thanks for your consideration

dtp555-1212 commented a year ago

Thanks

dtp555-1212 commented a year ago

thanks

Ulya Trofimovich · Answer 1 · Fri Mar 24 2023 15:56:31 GMT+0800 (China Standard Time)

Hi @dtp555-1212, this is a reasonable request. I need to experiment to see if there are any implementation difficulties, but I don't have objections to this in principle.

Just keep in mind that with POSIX disambiguation, you still need the full hierarchy of implicit groups if you have a nested capturing group, due to the complex hierarchical way disambiguation works. Therefore sometimes you have the processing overhead on these implicit groups that are not even present in your regexp.

dtp555-1212 · Answer 2 · Sat Mar 25 2023 02:58:44 GMT+0800 (China Standard Time)

FYI, with further investigation, it appears some other regular expression parsers are adopting the convention (?: for 'non-capturing groups' ... that may help make the intent clear on a case by case basis.

Ulya Trofimovich · Answer 3 · Sat Mar 25 2023 07:05:39 GMT+0800 (China Standard Time)

I have pushed a fix to allow the use of named definitions for implicit grouping: f519385. Please try it with your problematic example and let me know if it helps or not.

FYI, with further investigation, it appears some other regular expression parsers are adopting the convention (?: for 'non-capturing groups' ... that may help make the intent clear on a case by case basis.

Right, this was my next question: since we allow implicit non-capturing groups, it would make sense to allow explicit non-capturing groups as well. I recalled the discussion we had previously in #308 and the variants proposed there; (? seems reasonable and this syntax should not collide with any existing syntax in re2c (I will check to make sure).

dtp555-1212 · Answer 4 · Sat Mar 25 2023 11:47:57 GMT+0800 (China Standard Time)

I did not check out the runtime functionality, but I did check out the processing time. It did indeed improve the time from 6m4s down to 4m35s, but compared with no POSIX enabled at only 2.5s (for the unicode example file)... Since I will intend to have a large grammar, I think the long processing time (which I suspect does not grow linearly) won't work for me personally, but it should help others with small grammars. P.S. I hacked together a version of the grammar that uses 'stags' that does what I need and processes the grammar much quicker. It makes my actions/rules more tedidious to write, but it seems like my best option for the moment. If there was some hybrid that had the elegance of the POSIX capture syntax and the speed of the stags, (e.g. first ( is @t1 under the hood, second ( is @t2 ... and the results get put into yypmatch, etc...) I would love to try that out. Thanks you so much for your quick turnaround, and the service you provide.

Ulya Trofimovich · Answer 5 · Sat Mar 25 2023 15:43:28 GMT+0800 (China Standard Time)

I did check out the processing time. It did indeed improve the time from 6m4s down to 4m35s, but compared with no POSIX enabled at only 2.5s (for the unicode example file)...

That sounds about right. POSIX disambiguation is algorithmically complex, and it has to add implicit capturing groups from every capturing group nested in a sub-regexp up to the top of the regexp (these added groups do not result in yynmatch entries, but they participate in disambiguation algorithm).

If there was some hybrid that had the elegance of the POSIX capture syntax and the speed of the stags, (e.g. first ( is @t1 under the hood, second ( is @t2 ... and the results get put into yypmatch, etc...) I would love to try that out.

It should be possible to allow capturing parentheses syntax with leftmost-greedy disambiguation, which is used for @stag and #mtag. If you are after the POSIX syntax, not POSIX disambiguation semantics, things are easier. I'll investigate.

Ulya Trofimovich · Answer 6 · Sat Mar 25 2023 16:21:09 GMT+0800 (China Standard Time)

Meanwhile, I added syntax (? ...) for non-capturing groups: 1edd25d.

Ulya Trofimovich · Answer 7 · Sat Mar 25 2023 17:13:02 GMT+0800 (China Standard Time)

Let's keep this open while I'm working on leftmost-greedy captures.

Ulya Trofimovich · Answer 8 · Sat Mar 25 2023 20:14:49 GMT+0800 (China Standard Time)

@dtp555-1212 I pushed https://github.com/skvadrik/re2c/commits/master that adds a new option --leftmost-captures. You can use it instead of --posix-captures to get the POSIX syntax for capturing parentheses (now also non-capturing ones) with leftmost-greedy disambiguation, that should be approximately as efficient as you implementation with tags.

Also note that I have pushed a fix b813c9b for non-capturing groups with POSIX disambiguation.

If you have time, let me know how both options (--posix-captures and --leftmost-captures) run on your example.

dtp555-1212 · Answer 9 · Sun Mar 26 2023 06:26:20 GMT+0800 (China Standard Time)

Wonderful! I am in the airport at the moment, but I will certainly check that out and let you know. Thanks again!

…

________________________________ From: Ulya Trofimovich ***@***.***> Sent: Saturday, March 25, 2023 6:15 AM To: skvadrik/re2c ***@***.***> Cc: dtp555-1212 ***@***.***>; Mention ***@***.***> Subject: Re: [skvadrik/re2c] POSIX captures processing (Issue #438) @dtp555-1212<https://github.com/dtp555-1212> I pushed https://github.com/skvadrik/re2c/commits/master that adds a new option --leftmost-captures. You can use it instead of --posix-captures to get the POSIX syntax for capturing parentheses (now also non-capturing ones) with leftmost-greedy disambiguation, that should be approximately as efficient as you implementation with tags. Also note that I have pushed a fix b813c9b<b813c9b> for non-capturing groups with POSIX disambiguation. If you have time, let me know how both options (--posix-captures and --leftmost-captures) run on your example. — Reply to this email directly, view it on GitHub<#438 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADDLWOKCCPBPJAPUDWEABO3W53OULANCNFSM6AAAAAAWF74QZU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

dtp555-1212 · Answer 10 · Sun Mar 26 2023 12:59:45 GMT+0800 (China Standard Time)

I ran a time test on the same unicode file as before and 'wow' ... timewise it looks like we have a winner!!! Awesome Job! I have not done full functionality testing yet, but in a simple test I could see the output code change, based on non-capture and capture choices, so my next step would be to try to use it in my actual grammar. If there are any snags or questions, I will let you know. Thanks again!

…

________________________________ From: Ulya Trofimovich ***@***.***> Sent: Saturday, March 25, 2023 6:15 AM To: skvadrik/re2c ***@***.***> Cc: dtp555-1212 ***@***.***>; Mention ***@***.***> Subject: Re: [skvadrik/re2c] POSIX captures processing (Issue #438) @dtp555-1212<https://github.com/dtp555-1212> I pushed https://github.com/skvadrik/re2c/commits/master that adds a new option --leftmost-captures. You can use it instead of --posix-captures to get the POSIX syntax for capturing parentheses (now also non-capturing ones) with leftmost-greedy disambiguation, that should be approximately as efficient as you implementation with tags. Also note that I have pushed a fix b813c9b<b813c9b> for non-capturing groups with POSIX disambiguation. If you have time, let me know how both options (--posix-captures and --leftmost-captures) run on your example. — Reply to this email directly, view it on GitHub<#438 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADDLWOKCCPBPJAPUDWEABO3W53OULANCNFSM6AAAAAAWF74QZU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

dtp555-1212 · Answer 11 · Mon Mar 27 2023 10:08:49 GMT+0800 (China Standard Time)

First off, I want to say thanks for such a quick turnaround on the enhancement. It is a testament to how well you know your codebase, openness to new ideas, and your programming skill.

I have converted my grammar to use the new leftmost-capture syntax… here are my observations…

The processing speed of the grammar is fine (comparable to stags)
The capture group syntax is much cleaner, intuitive, and less error prone than the stag syntax. Rules and actions are simplified and result in smaller and more efficient code than the stags.
The implicit grouping (as you have it) on the rules are handy, removing the need to explicitly wrap every rule. (it is conceivable that some use case could not want the entire rule to be wrapped, so if the non-capture group syntax works for that,, I think that is a good way to go… I did not not try that specifically). FYI, if people explicitly wrap the entire rule, it results in a redundant answer in the yypmatch.
I think other regular expression parsers are using ‘(?:’ rather than ‘(?’… if you every wanted to extend the options to match some of those systems, it may cause a future conflict.)
Having to explicitly exclude all the groupings in the named definitions is a bit tedious and I think could be a potential source of errors, when people forget to use the non-capture format. Forgetting to explicitly mark them as non-capture can result in ‘significant’ slow downs both at processing and runtime for larger grammars that the user may be unaware of. (Alternatively (but not optimally) this may help mitigate this.... being able to disable the left-most-group effect in the named definitions, and only getting the effect in the ‘rule/action’ section.) e.g.
digit=([0-9]+); is a non-capture grouping
whereas
([0-9])+ { return …; } is doing the leftmost-capture.
)

Another suggestion to mitigate the downside of 4 & 5, I may have another idea. You can let me know if you see a problem in theory… If the leftmost capture ( ) symbols were something like < > (or any non-conflicting pair of characters)… This would let the () groups happily coexist and remove the need to explicitly mark all the other simple groups as non-capture, making them easier to read and simpler to maintain.

If that is possible, I have one other extension that would be well suited to the new syntax and would simplify the rules for many uses cases… Often, it would be very handy to have a payload or breadcrumb that is associated with each captured span… maybe it conveys some contextual or semantic meaning. Currently, the only place to attach such a thing is in the action, but often the optimal place to associate that information is at the capture group level. And at exactly the same places in the code that it remember the group boundary, if could also remember a breadcrumb, which would be very helpful (and results in less action code, potential reducing the number of end states). Assumng the < > syntax, maybe that could have an optional payload/breadcrumb that is remembered at the same time, and when it does the final process to fill yypmatch, it has a 1-1 corresponding
storing of the remembered breadcrumbs in some array maybe yypayload (or whatever)… I think the simplest implemention of the payload being an integer value (unsigned|unsigned long|etc) would be sufficient for most things… And the payload itself could be a value (or optimally something like bitwise “or’ed" sequence of named constants (e.g. HEX|NUM) that result in the value value being remembered. In fact, the re2c would simply pass through the payload/breadcrumb and let the compiler resolve the expression.)

So maybe it looks something like this…

<’0x’[0-9]+> # this is a simple capture with no payload (or a payload of 0, if payloads are enabled)

<HEX|NUM:’0x’[0-9]+> # this is a capture with a payload… internally, it might look like this…
yyt1 = YYCURSOR; (or whatever the normal thing to do is)
yyp1 = HEX|NUM;

then at final something like
yypmatch[0] = yyt1;
yypmatch[1] = YYCURSOR;
yypayload[0] = yyp1;

Let me know what you think.

P..S. Even without the <> extension, I will definitely be using the new leftmost-capture option. Thanks again

Ulya Trofimovich · Answer 12 · Tue Mar 28 2023 06:05:32 GMT+0800 (China Standard Time)

@dtp555-1212 Thanks for your thoughtful answer!

Regarding 5, I also would prefer if the default parentheses in POSIX syntax were non-capturing, but we should respect the universal default (that simple parentheses are capturing), and we should keep --posix-captures and --leftmost-captures symmetric in that regard (because it will be the user's expectation). However, I can add an option --invert-captures that would flip the default, making the usual parentheses non-capturing and the marked ones capturing. It would be the re2c way, as we already have one default-flipping option for the syntax of case (in)sensitive string literals.

Regarding 4, if we go with --invert-captures option, then a bang seems like a good syntax: (! ...) instead of (? ...) as the question mark suggests optionality. I don't want to copy the verbose (?: ...) syntax as re2c does not attempt to be syntax-compatible with regexp libraries, and shorter syntax is a more important property here.

Regarding angle brackets for non-capturing parentheses, they might conflict with start conditions, and they are a bit too exotic.

Regarding payloads, what would you gain by having yypayload[i] available in the semantic action as compared to having only yypmatch[i] there? Adding extra work to transitions instead of semantic actions is bad for performance (semantic action is executed only once, and the actions on transitions may be executed multiple times, and potentially match a different rule in the end). Also note that due to the possibility of not matching a capture, you would need some sort of default value for yypayload.
Also, I think integer numbers is too specific. The only generic mechanism that would work for all users is attaching a mid-rule semantic action, but that is not possible in general (only at certain deterministic points in the regexp).

Let me know if you think --invert-captures would be useful, it shouldn't be difficult to add. It might also take care of the implicit entire-rule capture (meaning, we can make the entire-rule group non-capturing if --invert-captures is set).

dtp555-1212 · Answer 13 · Tue Mar 28 2023 08:56:09 GMT+0800 (China Standard Time)

Since you are OK with not maintaining compatibility with other regex parsers, I kind of like the ‘(!’ syntax as it stands out more than ‘(?’ and as you note, it does have a completely different meaning.

My observations are that in the named definitions section (e.g. num=[0-9]+) captures are the rarity and the definitions are more complex, so (?(?(?… (or (!(!(! ) just make it harder to read. When () exist they are just simple groupings. In the rule/action sections the captures are very common, but the rule are relatively simple, so an added explicit character for the capture conveys intent without adding a lot verbosity. (e.g. (!num) ‘ ‘ (!num))

The potential issues I see with the invert are…

The default behavior is currently to capture the entire span, which is handy. So either that does not get effected by the flag OR each rule needs to have an explicit capture when the flag is inverted.
having the invert flag independent of the grammar description, makes the definition depend on the user match the intent in the build, rather than being self-defining. Also when the programmer is reading the file they have to know the intended switch setting to understand the intent. Also, unlike case sensitivity, inverting the capture regions is not functionally similar… e.g.

(! xxx) ( yyy) (!zzz)
!=
(xxx) (!yyy) (zzz)

so flipping the switch would also require inverting all the captures in the grammar too.

With those issues in mind, a global invert, might not be as valuable in practice (unless the default behavior causes more work in the grammar). e.g. If I only had a global switch, I would set it to ‘not’ capture by default, and mark my captures explicitly with (! …. as this matches my observations above… (i.e. the least verbose in most of the file, and meaningful information (e.g intent to capture when ‘(!’ is wanted)

Hope that helps on that question.

As for the ‘payload’…

I agree that just integer is limiting, but I didn’t want to muddle the concept or increase the effort to support ever kind payload.
example of use…

hex=(‘0x’[0-9])+;
bin=(‘0b’[0-1])+;
// oct, float, commaFormated, …
num=(hex|bin| …);

num { return NUM; }

In the above, you lose the information of what type the number is. (OR you would have to write separate rules/actions for each type, rather than being able to simplify the grammar with reuse)… That brings up a question… Is there any performance advantage/penalty either at processing time or at runtime, when you have to enumerate all the variation of the rules OR being able to do the reuse of definitions? Is one better than the other or the same?)

If you had a payload… maybe it looks something like this…

hex=(!HEX: ‘0x’[0-9])+;
bin=(!BIN: ‘0b’[0-1])+;
// oct, float, commaFormated, …
num=(hex|bin| …);

num { return NUM; /* and payload is known in yypayload[0]}

Hope that helps

Let me know if you have any questions

Ulya Trofimovich · Answer 14 · Tue Mar 28 2023 14:48:16 GMT+0800 (China Standard Time)

I don't fully understand your problem with using --invert-captures:

With those issues in mind, a global invert, might not be as valuable in practice (unless the default behavior causes more work in the grammar). e.g. If I only had a global switch, I would set it to ‘not’ capture by default, and mark my captures explicitly with (! …. as this matches my observations above… (i.e. the least verbose in most of the file, and meaningful information (e.g intent to capture when ‘(!’ is wanted)

So you just add the flag globally (as an option) and use the (! ...) syntax for captures. Anyone looking at your grammar would probably guess that you flipped the default from the way you use yypmatch in semantic rules, or they would discover the flag.

It is the same story with --case-inverted, strings 'xyz' and "xyz" flip their case-insensitivity, so you need to know about the flag to know what they mean. And it's the same story with regexp libraries and BRE / ERE syntax in POSIX, you need to know the options in order to interpret the syntax.

I cannot make non-capturing the default behavior, since we already have --posix-captures where the default is capturing, and changing that would break backwards compatibility. I cannot also make the default different for --posix-captures and --leftmost-captures.

As for the ‘payload’…

I understand your use case, but you could just as well add an stag to differentiate between alternatives:

hex = '0x'[0-9]+;
bin = '0b'[0-1]+;
// oct, float, commaFormated, ...
num = @h hex | @b bin | ...;

num { return h ? HEX : b ? BIN : ...; }

Or use a capture around hex and bin and inspect yypmatch entries in the same way as tags (it would be slightly less efficient and than stags, because it would add the pairing tag at the end of each capture, which is not needed in this example).

However, if possible (and in this example it is), then it is much better to use distinct rules for alternatives.

Using tags in your example may be slightly less elegant, but it is a very special case, and functionally tags do just as well. Note also that it would be really difficult to implement payloads compared to the added value for most of the users. It would require changes throughout the whole parser (new syntax), all intermediate representations (new regexp construct, changes on NFA and DFA transitions) and finally codegen. Users would also need to learn the new syntax and understand how it can be used. And the added value is a bit of syntactic sugar, which does not really justify maintenance cost for all the added complexity. I hope that clarifies why I don't support the payloads idea.

dtp555-1212 · Answer 15 · Tue Mar 28 2023 18:25:32 GMT+0800 (China Standard Time)

I am not saying don’t add the invert option. I am saying in practice, ‘if’ the default is non-capture (which I think makes the most sense; meaning it makes for the most normal, compact & readable grammar description; e.g. only use (! to mark the specific captures).. I would never flip that switch. And if someone did flip that switch, unlike case-sensitivity, the results are completely different. (i.e. the stuff being captured is suddenly the portions between what was being captured in the defined grammar, which I see little utility for, whereas the case inversion, is very handy, that you can define an entire grammar as being case sensitive, and then toggle it with no change to the grammar, and you end up with a very usable grammar that is simply more tolerant, without having to chang the grammar itself. i.e. it would do what ‘exactly’ it did before and more.)

And one subtle point, since the current default does an unmarked capture for the entire span, flipping the switch either has to ignore the inversion and continue to do so OR honor the flag and do no capture unless explicit told to. Not insurmountable, just need to define what the behavior should be. I like the default as you have it, but adding one more explicit (! for each rule would work too, if needs be.

If you can’t make the default non-capture, then yes I would like to have the toggle to flip it to be so.

Hope that clarifies the point.

As for the payload, your ideas to use of stag and/or yyparse capture are interesting, but I would like to confirm a few things, as I see some potential issues…
For the stag idea…

do I need to create all unique tag names for every breadcrumb/payload I want to leave, or only unique within a given parse path? When/how do the stags get initialized? If there are more than just a few of these OR if they are all initialed all the time, is that wasteful both processing time and at runtime? Plus the time to evaluate all of the tag variables in the path. My gut, says this is sub-optimal from a performance/space standpoint.

For the yypmatch idea…
2) Since everything that would have a payload already would have captures anywhere, I think the price has already been paid, so ‘maybe’ the only extra processing is in the rule to figure out what what actually filled, which would be required for stag idea too, but all the items to compare would already be in yypmatch? (Also, I think I remember seeing a mix of either NULLs or spans that had the same start/end tag indicating empty captures. Are both expected?)

And the other option I mentioned is to make redundant rules that only differ by the specific capture type. (e.g. one rule for hex, one for oct, one for ….) … Since this forces each to have a unique rule/action pair, what is the implication on processing time, runtime/space? Since I am anticipating a large grammar, I am a bit sensitive on the little things that add up when scaling, especially at runtime.

If the runtime performance/space is negligible for the redundant rules, that may be my best option lacking actual payload functionality. But, if you say it would have substantial waste, then my guess is that the yypmatch idea ‘may’ be a fallback, as the extra performance overhead is at least limited to the number of captures in the parse path, correct?

Tell me where you think my performance/space guesses are wrong.
So, if my understanding is correct, then I think you see the payload idea, is not really syntactic sugar, but a way to actually get smaller and faster parsers. (but, I understand that it indeed effort, so I totally understand if you don't want to implement such a thing in the mainstream re2c.)

And to confirm, other than the effort/maintenance, the idea is indeed possible, correct?

P.S. Speaking of performance/space, here are a couple of random observations…

since many (if not all) of the captures start at the start of the input string, there may be an optimization of initializing yyt1 once at the top of the processor, instead of at the start of each capture state that represents the start of the input… for large grammars, this can make a big differences for space. (In hand crafted parsers I have made in the past, this trick worked very well.)
Not related to capture: This may be known to you already, but it appears that the definition choices of the grammars that will result in the exact same result, have different processing time stats, sometimes substantially…

e.g.
P+ OR P{1,}
vs
P P*

The top two take about the same time to process, whereas the last form takes about twice as long to process.

Are there other known cases where different forms grammar syntax (that do the same thing) have a processing time pro/con?

dtp555-1212 · Answer 16 · Wed Mar 29 2023 11:37:16 GMT+0800 (China Standard Time)

Looking at some more generated code trying to find the optimal solution, I notice you don't merge 'identical' action states... (This would be a nice optimization for saving a lot of lot of space... in some examples almost a third of the code is the redundant ... setting of the yynmatch, yypmatch, and the code for the actual action (which are all the same)... Since I thought the system already did this optimization, 'payloads' were a way to avoid making the final action different. (And I can see that is why you thought it was just syntactic sugar, rather than reducing the program size 'substantially'. So, the penalty for enumerating all the rule differences is indeed pretty severe. Hope that gives some food for thought.

dtp555-1212 · Answer 17 · Wed Mar 29 2023 13:26:30 GMT+0800 (China Standard Time)

I have tried all the options and they all have some size/space penalty, HOWEVER, I think I have a way that, if you are going to fix the redundant code issue, that would 'not' require changes all the way through code stack, and "doesn't" change the syntax. First let me show you a small example of the redundant code, so you know what I mean…

prior to every user action, there can be a list of yynmatch and yypmatch assignments… often times these are redundant (and with lots of captures can be long)...

yy#:
yynmatch = 10;
yypmatch[0] = yyt1;
yypmatch[1] = YYCURSOR;
…
yypmatch[8] = yyt4;
yypmatch[9] = yyt5;
{ return 1; }

Note: this exact pattern may be repeated ‘many’ times in the result source file. Of course, for space saving, all the exact matches can be merged. Likewise, if all the yy settings are the same, even ‘some’ not exact user actions can be optimized bit ‘lifting out’ the starting difference, leaving a potentially ‘common/complex’ user action… e.g.

yy#:
yynmatch = 10;
yypmatch[0] = yyt1;
yypmatch[1] = YYCURSOR;
…
yypmatch[8] = yyt4;
yypmatch[9] = yyt5;
{ uniqueCode=1; return 1; }

BECOMES
uniqueCode=1;
goto yy### (a new common state that contains all the redundant code)

yy###: // this code is shared by potentially ‘many’ states
yynmatch = 10;
yypmatch[0] = yyt1;
yypmatch[1] = YYCURSOR;
…
yypmatch[8] = yyt4;
yypmatch[9] = yyt5;
{ return 1; }

This would give you a general purpose optimization… If you are not comforatble with a totally general optimization, the ‘unique code’ could be limited by convention… e.g. only the first statements, that match a specific pattern … e,g PAYLOAD[^;]; ... if those are found they are lifted out OR better yet, maybe the list of initial assignment statements up until any leave any complex or branching statements like if, switch, etc…)

Let me know what you think

Ulya Trofimovich · Answer 18 · Thu Mar 30 2023 07:36:37 GMT+0800 (China Standard Time)

do I need to create all unique tag names for every breadcrumb/payload I want to leave, or only unique within a given parse path?

Tag names should be unique within rule.

When/how do the stags get initialized? If there are more than just a few of these OR if they are all initialed all the time, is that wasteful both processing time and at runtime?

For a given rule, if it matches, all tags in it are guaranteed to be initialized (whether to an offset/pointer in the input, or to some default value). Tags of non-matching rules may not be initialized (but they also cannot be used in the semantic action of this rule).

Plus the time to evaluate all of the tag variables in the path. My gut, says this is sub-optimal from a performance/space standpoint.

It would be exactly the same with payloads. There is no way around the overhead on nondeterminism. It depends on the grammar: in some cases it is possible have just one tag assignment, in other cases multiple assignments are needed. Also, re2c performs quite a few tag optimizations on top of the TDFA model.

Since everything that would have a payload already would have captures anywhere, I think the price has already been paid, so ‘maybe’ the only extra processing is in the rule to figure out what what actually filled, which would be required for stag idea too, but all the items to compare would already be in yypmatch?

Yes, if you already have a tag there, you can just use the tag to set the payload. But this could also be easily done by the user from in the code of semantic action (and more generally, any other logic based on tag inspection). However for re2c to add the new syntax for payloads and pull it through all the intermediate representations to the codegen, it would be a lot of changes in the code.

(Also, I think I remember seeing a mix of either NULLs or spans that had the same start/end tag indicating empty captures. Are both expected?)

I'm not sure I understand. A tag can be NULL or non-NULL (a pointer in the input). Both values may or may not be expected depending on the regexp and the location of the tag in it (some tags can never be NULL). For a pair of tags that represent a capture, if they are both NULL it means the capture didn't match; if they are both non-NULL the capture did match (if they are equal, the capture matched and empty substring).

And the other option I mentioned is to make redundant rules that only differ by the specific capture type. (e.g. one rule for hex, one for oct, one for ….) … Since this forces each to have a unique rule/action pair, what is the implication on processing time, runtime/space? Since I am anticipating a large grammar, I am a bit sensitive on the little things that add up when scaling, especially at runtime.

Having separate rules is completely fine, I expect that it will be more efficient than using tags within one rule to differentiate between alternatives in semantic action of this rule. Separate rules don't add any work on transitions, they simply add separate final states to the DFA. And tags should be used for more subtle things when some information is needed within one rule.

So, if my understanding is correct, then I think you see the payload idea, is not really syntactic sugar, but a way to actually get smaller and faster parsers. (but, I understand that it indeed effort, so I totally understand if you don't want to implement such a thing in the mainstream re2c.)

I don't think it would lead to smaller and faster parsers. If I did think this way, I would be prepared to go an extra mile as re2c main focus is on generating fast code. But payloads must use the same mathematical model as tags (or just piggyback on existing tags), and therefore they would be less efficient than multiple rules.

And to confirm, other than the effort/maintenance, the idea is indeed possible, correct?

Yes, it would be possible to implement on top of tags (or in the same way as tags).

since many (if not all) of the captures start at the start of the input string, there may be an optimization of initializing yyt1 once at the top of the processor, instead of at the start of each capture state that represents the start of the input…

re2c already does compiler-like optimizations on tags, which includes:

hoisting tags from transitions to states
minimizing the number of tag variables (e.g. representing multiple tags with the same variable)
removing dead tag assignments
eliminating tags that are within fixed distance of another tag and can be computed basd on the other tag by adding a fixed offset
allocation of tag variables similar to register allocation with copy coalescing
DFA minimization

and some more. You can compile with --no-optimize-tags to see how much worse it would be without those optimizations.

Not related to capture: This may be known to you already, but it appears that the definition choices of the grammars that will result in the exact same result, have different processing time stats, sometimes substantially…

e.g.
P+ OR P{1,}
vs
P P*

The top two take about the same time to process, whereas the last form takes about twice as long to process.

Are there other known cases where different forms grammar syntax (that do the same thing) have a processing time pro/con?

re2c constructs an NFA and then converts it to a DFA. It does some primitive simplifications on the regexp before constructing the NFA, but it does not attempt to bring a regexp to some normal form (in the presence of tags such transformations could ruin user-defined semantics of the regexp).

Looking at some more generated code trying to find the optimal solution, I notice you don't merge 'identical' action states...

re2c does this since version 0.16 (DFA minimization). It also does this in the presence of tags. If you observe two states that are not merged, then it means that they are not equivalent (can lead to different outcomes).

in some examples almost a third of the code is the redundant ...

Please share this example (create a new issue and attach your source grammar and the generated code there as files, not inline comments, and also explain which identical states are duplicated).

So, the penalty for enumerating all the rule differences is indeed pretty severe. Hope that gives some food for thought.

No, I don't understand from a generic description. I need an example to see what you mean, and an modification of this example that would demonstrate how payloads will solve the problem.

Also, note that when optimizing for code size, you should rely on the stripped binary size, not the generated source code which may sometimes be misleading.

prior to every user action, there can be a list of yynmatch and yypmatch assignments… often times these are redundant (and with lots of captures can be long)...

re2c has no way of knowing which captures/tags the user is going to use in the semantic action, and which are not needed (it has no idea what happens in the semantic action, so it assumes that all tags and captures are needed). The user should only use capturing parentheses/tags if they need them. Also note that the compiler, contrary to re2c, is able to eliminate unnecessary assignments quite easily.

Of course, for space saving, all the exact matches can be merged.

No, they cannot, as the final states are all different (they have different semantic actions) and cannot be merged or have a common part. Even if you moved those assignments to a function, it would either get inlined, or it would introduce function call overhead, which is much worse for performance.

Likewise, if all the yy settings are the same, even ‘some’ not exact user actions can be optimized bit ‘lifting out’ the starting difference, leaving a potentially ‘common/complex’ user action… e.g.

As I explained above, re2c performs this optimization to the extent possible.

Ulya Trofimovich · Answer 19 · Thu Mar 30 2023 07:44:53 GMT+0800 (China Standard Time)

I think I see source of confusion.

a new common state that contains all the redundant code

This is the problem. Once jumping to that common state, how would you transition out out of it back to the correct states? You would either have to re-match the last character, or save the state somehow else and dispatch on it, which is costly.

In certain cases merging similar states can indeed be done (with rematching the character when transitioning out of the common state). This is called "tunneling" or "tunnel automaton" and re2c already does it where possible. In the presence of tags this is much harder to do since states differ also in tagged transitions, making them even less alike.

dtp555-1212 · Answer 20 · Thu Mar 30 2023 09:10:31 GMT+0800 (China Standard Time)

Actually, I was talking about only 'final' states (e.g. the user does the action and returns), so there is no going back to any other state. Also, from your other comments, I think that you are thinking optimization to the state machine itself... Those are just fine. It is the redundant code that gets written (during codegen?) that set the yynmatch and all the yypmatches, and the user actioncode. By lifting out the 'payload' (the initial assignment statement(s)) and putting it prior to the yynmatch line, many the final code sequence and actions end up with exactly the same code (or at least a smaller group of these terminal sequences.) With all the gotos the compiler does not find these optimizations either at all (or take a long time with large grammars). Since I have proved to myself that this idea is no longer 'just theoretically' better, but I can see a path to substantially smaller code (which also compiles faster), if re2c doesn't end up implementing the 'codegen' optimization for all users, I 'may' be able to write a post processing that achieves the goal, since all the highly redundant code starts with the same signature.. (e.g. yynmatch, yypmatch assignments, and the user action code.) . I will put my 'payload' assignment(s) at the start of the action, and lift them to before the yynmatch... And for the resulting common sequences of assignments and actions (which after moving the payload out, end up all (in many uses cases) doing the same thing, resulting in the ability to consolidate all the common code, and jump to it, rather than keeping all the redundant code duplicated. Hope that helps clarify the intended code reduction optimization. Since it is general purpose, no restriction to 'payloads', I think it would be a welcome addition to all the users. Either way, let me know if you now see better what I meant, and your decision if you intend to add the optimization for everybody. Thanks again... It is nice working with you, as a fellow regex advocate. :-)

dtp555-1212 · Answer 21 · Thu Mar 30 2023 12:16:56 GMT+0800 (China Standard Time)

FYI, I wrote the post processor... I didn't do anything exotic to maximize the reduction, but just doing the straightforward as described, it reduced the lines of code by 561, and reduced the size of the executable by over 24K on the first file I ran it on (of course results will differ based on the grammar). It turned out to be a single pass, since the first instance of the code is just jumped to when/if it finds a redundancy. So, it never adds any overhead (other than a label) in code.

P.S. and when adding in the yyt1 optimization is saved over 25K

Ulya Trofimovich · Answer 22 · Thu Mar 30 2023 14:13:05 GMT+0800 (China Standard Time)

Can you attach an example of a .re file and the generated source code with / without the optimization, so that I can understand what your optimization does? Note that you can remove all the user-defined code and leave just the re2c section if you feel uncomfortable posting the original example here. I'm definitely interested to see what you've done, but it's hard to understand from a textual description without seeing the real-world code.

dtp555-1212 · Answer 23 · Thu Mar 30 2023 16:09:46 GMT+0800 (China Standard Time)

I will indeed send you a simple before/after re files for you to take a look at... In the meantime... I looked at the code after my post processing to see if there is any other low hanging fruit to optimize… It mostly looks pretty good, but I saw one thing that would be a substantial speed up for large grammars, and one space optimization. Speed: in my resulting file, I notice a large ‘binary search’ in the if statements for the ‘accept’ cases. In the case of accept, the number range is known and contiguous, so a computed goto would directly go to where is needed rather than having to do the binary search speeding up the operation optimally. I enabled to computed-gotos and noticed two things… the switch did not effect the accept binary search , and I saw that it make the code ‘very’ large, as it fills in the entire range rather than just in the min/max range. (e.g. for unsigned chars, subtract min, then check for in range of 0..(max-min), if so, do computed goto, otherwise the goto nomatch target.) I would probably not use the ‘global’ computed-goto, since the optimal balance of space/speed is based on the density of the goto table, rather than a one-size fits all (but if the min/max optimization were done, it is plausible it would be of more value), so it would be preferred in the ‘accept case computed goto’ have its own switch to enable. Also, often the very first case statement, when it has a lot of choices filled in, is a good choice for computed goto without the min/max code. That ‘could’ be enabled with the same switch as the accept, since a large grammar ‘may’ also present a dense first switch, and even it not, the space wasted is minimal. Let me know if you have any question on these, and if they sound like something you would be interested in doing. Hope that helps

…

________________________________ From: Ulya Trofimovich ***@***.***> Sent: Thursday, March 30, 2023 12:13 AM To: skvadrik/re2c ***@***.***> Cc: dtp555-1212 ***@***.***>; Mention ***@***.***> Subject: Re: [skvadrik/re2c] POSIX captures processing (Issue #438) Can you attach an example of a .re file and the generated source code with / without the optimization, so that I can understand what your optimization does? Note that you can remove all the user-defined code and leave just the re2c section if you feel uncomfortable posting the original example here. I'm definitely interested to see what you've done, but it's hard to understand from a textual description without seeing the real-world code. — Reply to this email directly, view it on GitHub<#438 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADDLWOKJ2PGRMVLC3CX2BSLW6UP7ZANCNFSM6AAAAAAWF74QZU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

dtp555-1212 · Answer 24 · Thu Mar 30 2023 16:57:14 GMT+0800 (China Standard Time)

re.zip
In the zip file are before.re, before.c, after.c, and beforeAfter.dif
This in a small artificial example, that shows 3 unique rules, two of which are reduced (e.g. reuse the common code of the first).... There is nothing special about the word PAYLOAD (or the label prefix _com... I used it for commonCode, just so I could easily find it when I was writing the postprocessor)... Both can be anything, but since we have been using that term, I put it in the example. Hope that helps

Ulya Trofimovich · Answer 25 · Fri Mar 31 2023 07:02:40 GMT+0800 (China Standard Time)

Thanks! I will have a look.

Ulya Trofimovich · Answer 26 · Fri Mar 31 2023 14:13:17 GMT+0800 (China Standard Time)

@dtp555-1212 On a side note, I noticed is that you define tag variables manually (yyt1, yyt2 and others) --- don't do that, use the /*!stags:re2c ... */ directive to autogenerate them.

Ulya Trofimovich · Answer 27 · Fri Mar 31 2023 14:47:09 GMT+0800 (China Standard Time)

Looking at your example re.zip, I don't understand a few things:

In this example, you don't need neither tags/captures, nor payloads to make that PAYLOAD assgnment in the semantic action. Upon matching a rule, that information is readily available to you (e.g. that rule hex bin hex matches HEX, followed by BIN, followed by HEX).
If some parts of your rule were conditional, e.g. hex (bin | dec) hex, then you would need a tag to know whether bin or dec has matched. That could be easily achieved with a single stag, e.g. hex (@b bin | dec) hex { PAYLOAD[1] = b ? BIN : DEC; ... }.
If you use captures instead of stags, then a) it is inevitable that you will have two yypmatch entries per capture, and b) those assignments before semantic rules would be impossible for the compiler to eliminate (it is a write to memory yypmatch, not to a local variable like in the case of stags).
Finally, in your example after.re you saved space by completely removing yypmatch assignments. re2c cannot do that, because it doesn't know whether user-defined code will use them (or only some of them, or none of them). re2c assumes that if the user specified a capture, then the user needs the corresponding yypmatch values.

So unless you have a more complex real-world example, I don't see why you need tags or payloads at all. And if you need them, I still think that you could do just as well with stags.

Please understand that I'm not trying to disprove your optimization, I'm trying to understand your real-world use case and to see if it can be generalized to the common case. You can help by providing a real-world example.

Ulya Trofimovich · Answer 28 · Fri Mar 31 2023 14:56:53 GMT+0800 (China Standard Time)

I would probably not use the ‘global’ computed-goto, since the optimal balance of space/speed is based on the density of the goto table, rather than a one-size fits all

There is a configuration re2c:cgoto:threshold if you'd like to experiment with the balance (the default value is 9).

it would be preferred in the ‘accept case computed goto’ have its own switch to enable.

Or perhaps, a separate tunable threshold for accept states.

Again, the same request: please share your real-world example, or let me know why you cannot do that) Note that it doesn't have to compile as long as re2c can process it --- I only need the regexp grammar and semantic actions to see how you are using tags/captures or payloads, how many rules you have, etc.

dtp555-1212 · Answer 29 · Fri Mar 31 2023 15:44:02 GMT+0800 (China Standard Time)

Thanks... will do. (I had tried it before when I was quickly putting together the example, and it wasn't working, but I see that I was missing the ! )

…

________________________________ From: Ulya Trofimovich ***@***.***> Sent: Friday, March 31, 2023 12:13 AM To: skvadrik/re2c ***@***.***> Cc: dtp555-1212 ***@***.***>; Mention ***@***.***> Subject: Re: [skvadrik/re2c] POSIX captures processing (Issue #438) @dtp555-1212<https://github.com/dtp555-1212> On a side note, I noticed is that you define tag variables manually (yyt1, yyt2 and others) --- don't do that, use the /*!stags:re2c ... */ directive to autogenerate them. — Reply to this email directly, view it on GitHub<#438 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADDLWOMVZPPT67HIYFPHWSTW6ZYYPANCNFSM6AAAAAAWF74QZU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

dtp555-1212 · Answer 30 · Fri Mar 31 2023 16:01:23 GMT+0800 (China Standard Time)

Great to hear it has a threshold for the computed goto. I will check it out.... as currently implemented, does 9 mean... if there are more than 9 items (or 9 nested layers) to choose from it will do a computed goto table that has 256 entries? It was not clear to me from the man page. Also, with either setting computed-goto (on/off) I always saw the accept as a binary search of if statements... Is there something special I need to do to turn the accept code into a computed goto? .... FYI, In my case, I have 193 accept choices (so far). if that makes a difference. Yes, seperate control over accept would be great!

…

________________________________ From: Ulya Trofimovich ***@***.***> Sent: Friday, March 31, 2023 12:57 AM To: skvadrik/re2c ***@***.***> Cc: dtp555-1212 ***@***.***>; Mention ***@***.***> Subject: Re: [skvadrik/re2c] POSIX captures processing (Issue #438) I would probably not use the ‘global’ computed-goto, since the optimal balance of space/speed is based on the density of the goto table, rather than a one-size fits all There is a configuration re2c:cgoto:threshold if you'd like to experiment with the balance (the default value is 9). it would be preferred in the ‘accept case computed goto’ have its own switch to enable. Or perhaps, a separate tunable threshold for accept states. Again, the same request: please share your real-world example, or let me know why you cannot do that) Note that it doesn't have to compile as long as re2c can process it --- I only need the regexp grammar and semantic actions to see how you are using tags/captures or payloads, how many rules you have, etc. — Reply to this email directly, view it on GitHub<#438 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADDLWOOPL6OP7JIAACPDDILW6Z54BANCNFSM6AAAAAAWF74QZU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

dtp555-1212 · Answer 31 · Fri Mar 31 2023 16:25:23 GMT+0800 (China Standard Time)

The example is contrived, and was trying to be small but still demonstrate that by moving the unique part of the user action above the capture assignment, many of the terminal sequences become exactly the same, and thus can be consolidated to save space (and, with no extra overhead of if statements, etc.. no need try to 'figure out' what was done, since the system already knows). I didn't 'remove' any yypmatch assignments per se. I only collapsed the redundant statement to reduce the code size. The 'exact' same functionality still exists but in a smaller form. I'm afraid, a larger example would only make things more confusing for you. so hopefully the above will click this time. The essence is small, faster, code. Although, I would love to have others have the benefit of this enhancement, since I have a solution for this one, maybe we can transition to some of the other optimizations. (However, if you want to continue to clarify, I could do that too.) Hope that helps

…

________________________________ From: Ulya Trofimovich ***@***.***> Sent: Friday, March 31, 2023 12:47 AM To: skvadrik/re2c ***@***.***> Cc: dtp555-1212 ***@***.***>; Mention ***@***.***> Subject: Re: [skvadrik/re2c] POSIX captures processing (Issue #438) Looking at your example re.zip, I don't understand a few things: * In this example, you don't need neither tags/captures, nor payloads to make that PAYLOAD assgnment in the semantic action. Upon matching a rule, that information is readily available to you (e.g. that rule hex bin hex matches HEX, followed by BIN, followed by HEX). * If some parts of your rule were conditional, e.g. hex (bin | dec) hex, then you would need a tag to know whether bin or dec has matched. That could be easily achieved with a single stag, e.g. hex ***@***.*** bin | dec) hex { PAYLOAD[1] = b ? BIN : DEC; ... }. * If you use captures instead of stags, then a) it is inevitable that you will have two yypmatch entries per capture, and b) those assignments before semantic rules would be impossible for the compiler to eliminate (it is a write to memory yypmatch, not to a local variable like in the case of stags). * Finally, in your example after.re you saved space by completely removing yypmatch assignments. re2c cannot do that, because it doesn't know whether user-defined code will use them (or only some of them, or none of them). re2c assumes that if the user specified a capture, then the user needs the corresponding yypmatch values. So unless you have a more complex real-world example, I don't see why you need tags or payloads at all. And if you need them, I still think that you could do just as well with stags. Please understand that I'm not trying to disprove your optimization, I'm trying to understand your real-world use case and to see if it can be generalized to the common case. You can help by providing a real-world example. — Reply to this email directly, view it on GitHub<#438 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADDLWOP6SQ5M5AFDZ7QEGNTW6Z4XTANCNFSM6AAAAAAWF74QZU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Ulya Trofimovich · Answer 32 · Fri Mar 31 2023 17:10:53 GMT+0800 (China Standard Time)

I'm afraid, a larger example would only make things more confusing for you. so hopefully the above will click this time.

No, what is really confusing me is a small contrived example that don't show real-word use case. :)

So please attach the large example. If I get confused, it is not such a big problem.

dtp555-1212 · Answer 33 · Fri Mar 31 2023 18:08:03 GMT+0800 (China Standard Time)

In a prior reply you said ..." I only need the regexp grammar and semantic actions to see how you are using tags/captures "... I hope you see now that the two are inextricably linked....There is no subset that would demonstrate the optimization better than the simple example you already have. If I remove the unique code, there is nothing to do/see. Maybe it would be helpful to forget the word payload, since that was only meaningful when the proposed solution would require a syntax change, as it represented as a parameter to the tag itself. I think it is easier to think about it as a simple code reduction optimization... as you know, a lot a of duplicate code is generated when tags are used... if the unique code of the user action is moved above the yynmatch, what is left are some contiguous lines of code that can be shared... Sharing (jumping to) the common code results in a smaller file. With this understanding, maybe the example will take on some more insight. If not, at least you know why another example won't help.

…

________________________________ From: Ulya Trofimovich ***@***.***> Sent: Friday, March 31, 2023 3:11 AM To: skvadrik/re2c ***@***.***> Cc: dtp555-1212 ***@***.***>; Mention ***@***.***> Subject: Re: [skvadrik/re2c] POSIX captures processing (Issue #438) I'm afraid, a larger example would only make things more confusing for you. so hopefully the above will click this time. No, what is really confusing me is a small contrived example that don't show real-word use case. :) So please attach the large example. If I get confused, it is not such a big problem. — Reply to this email directly, view it on GitHub<#438 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADDLWONGG3N2CAXFJYYEUITW62NSTANCNFSM6AAAAAAWF74QZU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

dtp555-1212 · Answer 34 · Sun Apr 02 2023 09:37:32 GMT+0800 (China Standard Time)

FYI, I have tried as many combinations as possible of switches and thresholds I can think of to get the 'accept if then else chain' to change to computed-gotos with no success. (I can see the computed-gotos working for other things in the file)... So, either there is a combination that I missed, or there is an omission for that specific case. (FYI, there are over 800 lines of code in that section with the large binary search of the greater than 180 accept choices). If this worked, it would be 'very' fast, since it would go directly rather than having to do the binary search.

dtp555-1212 · Answer 35 · Wed Apr 05 2023 15:58:15 GMT+0800 (China Standard Time)

I think I have a clue on the accept case described above.... I think I have discerned a pattern... for the switch statements that contain 'no' extra code besides goto statements, they are converted. The other ones that have even only 1 or 2 cases that have an additional assignment(s) (e.g. yyt2 = NULL;) seems to be the thing that disables the conversion to use computed gotos for the entire switch statement block. I 'think' that is why the accept if/then/else case doesn't get optimized.

dtp555-1212 · Answer 36 · Wed Apr 12 2023 05:16:57 GMT+0800 (China Standard Time)

FYI, since I didn't hear back if you were interested in doing the optimizations, I went ahead and wrote a compiler that optimizes the output as described above (as well as generating better assembly than generic compilers on this particular use case). So, far I am seeing about a 20% speed up. So, since I have a solution, I don't need re2c to optimize its output. Thanks again

Ulya Trofimovich · Answer 37 · Wed Apr 12 2023 05:28:21 GMT+0800 (China Standard Time)

@dtp555-1212 There are a few changes I plan to make: add --invert-captures option and change syntax of non-capturing parentheses to (! ...), working on that now. Hopefully when it lands, it won't be too disruptive for you.

As for your optimization, I have asked you multiple times in this thread to provide a real-world example. It is essential that re2c development is driven by real-world use cases (and you can find many of them in the test suite --- they usually make the most interesting and complex test cases). This development rule is not something re2c-specific, e.g. the Linux kernel won't accept any major code changes without compelling use cases.

Anyway, I am glad you were able to optimize your code this way or that.

Ulya Trofimovich · Answer 38 · Fri Apr 28 2023 05:27:26 GMT+0800 (China Standard Time)

Option --invert-captures and the corresponding configuration re2c:invert-captures have been added, and the syntax for (non-)capturing parentheses is now (! ...).

dtp555-1212 · Answer 39 · Fri Apr 28 2023 05:39:22 GMT+0800 (China Standard Time)

Thanks... P.S. I have been doing some exploration starting with first principles for potential optimizations. I am creating minimized epsilon-NFAs from regex... I avoid going to DFA which avoids the time and 'space' of that conversion. Early signs are very promising. I realize that would be major surgery for your code, so I don't expect you would do that anytime soon (if ever), but I thought I would share the idea with you and the community anyway. Thanks again

…

________________________________ From: Ulya Trofimovich ***@***.***> Sent: Thursday, April 27, 2023 3:27 PM To: skvadrik/re2c ***@***.***> Cc: dtp555-1212 ***@***.***>; Mention ***@***.***> Subject: Re: [skvadrik/re2c] POSIX captures processing (Issue #438) Option --invert-captures and the corresponding configuration re2c:invert-captures have been added, and the syntax for (non-)capturing parentheses is now (! ...). — Reply to this email directly, view it on GitHub<#438 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADDLWOP566Z4HBUQAZUUKRTXDLQERANCNFSM6AAAAAAWF74QZU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Ulya Trofimovich · Answer 40 · Fri Apr 28 2023 06:15:30 GMT+0800 (China Standard Time)

I have been doing some exploration starting with first principles for potential optimizations. I am creating minimized epsilon-NFAs from regex... I avoid going to DFA which avoids the time and 'space' of that conversion. Early signs are very promising. I realize that would be major surgery for your code, so I don't expect you would do that anytime soon (if ever), but I thought I would share the idea with you and the community anyway.

Yes, I think re2c will always stay DFA-based. It's intended for small or medium-sized lexers (e.g. for a programming language grammar), where an optimized and compiled direct-code DFA is much faster than NFA, and not for super-huge grammars that make DFAs impractical.

There is an experiments libre2c with various NFA and DFA-based algorithms which shares the core codebase with re2c, it can be used as a regexp library.

Ivan888866 · Answer 41 · Fri Apr 26 2024 08:13:43 GMT+0800 (China Standard Time)

v~7n!~DmA:1