skvadrik / re2c

Lexer generator for C, C++, Go and Rust.

Home Page:https://re2c.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

POSIX captures processing

dtp555-1212 opened this issue · comments

POSIX captures processing

I have a ‘re’ file that processes reasonably quick and without error when I don’t use the ‘-P’ flags to enable POSIX captures, but when enabling it, I notice 2 behaviors… It gets substantially slower (e.g. minutes vs seconds) and in the worst case I am getting a crash due to ‘bad_alloc’).

The third behavior is that when the POSIX captures are enabled, a message is issue that ‘implicit groupings’ is forbidden. This leads me to a theoretical enhancement, that may lead to a speed improvement and smaller results as well.

I ‘think’ that the implicit group rule is arbitrary. I understand why such a thing would exist, but there may be a way to accomplish both.

A named definition e.g.
num = [0-9]+;

num { return 1; }

is way to facilitate reuse and readability, rather than …

[0-9]+ { return 1; }

Using a named definition has other benefits as well, as it also conceptually serves as way to ‘group’ without using the () and their defined POSIX capture meaning.

In the case of the POSIX captures, by enforcing the explicit grouping, I think it has a wasteful side effect. (e.g. it ‘must’ track the subpattern start and stop for every named definition).

Since it is possible to create a valid and unambiguous grammar, without the extra explicit grouping (e.g. num = ([0-0+); ) … Forcing the explicit, takes away some potential. Using the () only when you want to explicitly gather the substring would reduce the size of the output, speed the processing, and reduce the size of the yypmatch to only what is desired to be saved.

For example…
num = [0-9]+;

(num) ‘ ‘ (num) { return 1; }

Using the () only when and where you want to capture, gives full control, and reduces the waste. Of course, this is a toy example, but you can see the ‘big’ negative effect even with a relatively small grammar (e.g. your unicode_indentifier.re example)… without the -P it processes in seconds, but with the -P (and after you add the () to satisfy the program it takes minutes… it also, results in 6 yynmatch (since it is tracking all the subpatterns; which in this case are only there to facilitate the definition and not a desired ‘keeper’ sub pattern) rather than just the 1 that bounds the identifier.

I love what you have done, and hopefully such a change is possible, and it can result in a dramatic speed up of both re2c processing as well as the runtime result.

Thanks for your consideration

Hi @dtp555-1212, this is a reasonable request. I need to experiment to see if there are any implementation difficulties, but I don't have objections to this in principle.

Just keep in mind that with POSIX disambiguation, you still need the full hierarchy of implicit groups if you have a nested capturing group, due to the complex hierarchical way disambiguation works. Therefore sometimes you have the processing overhead on these implicit groups that are not even present in your regexp.

Thanks

FYI, with further investigation, it appears some other regular expression parsers are adopting the convention (?: for 'non-capturing groups' ... that may help make the intent clear on a case by case basis.

I have pushed a fix to allow the use of named definitions for implicit grouping: f519385. Please try it with your problematic example and let me know if it helps or not.

FYI, with further investigation, it appears some other regular expression parsers are adopting the convention (?: for 'non-capturing groups' ... that may help make the intent clear on a case by case basis.

Right, this was my next question: since we allow implicit non-capturing groups, it would make sense to allow explicit non-capturing groups as well. I recalled the discussion we had previously in #308 and the variants proposed there; (? seems reasonable and this syntax should not collide with any existing syntax in re2c (I will check to make sure).

I did not check out the runtime functionality, but I did check out the processing time. It did indeed improve the time from 6m4s down to 4m35s, but compared with no POSIX enabled at only 2.5s (for the unicode example file)... Since I will intend to have a large grammar, I think the long processing time (which I suspect does not grow linearly) won't work for me personally, but it should help others with small grammars. P.S. I hacked together a version of the grammar that uses 'stags' that does what I need and processes the grammar much quicker. It makes my actions/rules more tedidious to write, but it seems like my best option for the moment. If there was some hybrid that had the elegance of the POSIX capture syntax and the speed of the stags, (e.g. first ( is @t1 under the hood, second ( is @t2 ... and the results get put into yypmatch, etc...) I would love to try that out. Thanks you so much for your quick turnaround, and the service you provide.

I did check out the processing time. It did indeed improve the time from 6m4s down to 4m35s, but compared with no POSIX enabled at only 2.5s (for the unicode example file)...

That sounds about right. POSIX disambiguation is algorithmically complex, and it has to add implicit capturing groups from every capturing group nested in a sub-regexp up to the top of the regexp (these added groups do not result in yynmatch entries, but they participate in disambiguation algorithm).

If there was some hybrid that had the elegance of the POSIX capture syntax and the speed of the stags, (e.g. first ( is @t1 under the hood, second ( is @t2 ... and the results get put into yypmatch, etc...) I would love to try that out.

It should be possible to allow capturing parentheses syntax with leftmost-greedy disambiguation, which is used for @stag and #mtag. If you are after the POSIX syntax, not POSIX disambiguation semantics, things are easier. I'll investigate.

Meanwhile, I added syntax (? ...) for non-capturing groups: 1edd25d.

thanks

Let's keep this open while I'm working on leftmost-greedy captures.

@dtp555-1212 I pushed https://github.com/skvadrik/re2c/commits/master that adds a new option --leftmost-captures. You can use it instead of --posix-captures to get the POSIX syntax for capturing parentheses (now also non-capturing ones) with leftmost-greedy disambiguation, that should be approximately as efficient as you implementation with tags.

Also note that I have pushed a fix b813c9b for non-capturing groups with POSIX disambiguation.

If you have time, let me know how both options (--posix-captures and --leftmost-captures) run on your example.

First off, I want to say thanks for such a quick turnaround on the enhancement. It is a testament to how well you know your codebase, openness to new ideas, and your programming skill.

I have converted my grammar to use the new leftmost-capture syntax… here are my observations…

  1. The processing speed of the grammar is fine (comparable to stags)
  2. The capture group syntax is much cleaner, intuitive, and less error prone than the stag syntax. Rules and actions are simplified and result in smaller and more efficient code than the stags.
  3. The implicit grouping (as you have it) on the rules are handy, removing the need to explicitly wrap every rule. (it is conceivable that some use case could not want the entire rule to be wrapped, so if the non-capture group syntax works for that,, I think that is a good way to go… I did not not try that specifically). FYI, if people explicitly wrap the entire rule, it results in a redundant answer in the yypmatch.
  4. I think other regular expression parsers are using ‘(?:’ rather than ‘(?’… if you every wanted to extend the options to match some of those systems, it may cause a future conflict.)
  5. Having to explicitly exclude all the groupings in the named definitions is a bit tedious and I think could be a potential source of errors, when people forget to use the non-capture format. Forgetting to explicitly mark them as non-capture can result in ‘significant’ slow downs both at processing and runtime for larger grammars that the user may be unaware of. (Alternatively (but not optimally) this may help mitigate this.... being able to disable the left-most-group effect in the named definitions, and only getting the effect in the ‘rule/action’ section.) e.g.
    digit=([0-9]+); is a non-capture grouping
    whereas
    ([0-9])+ { return …; } is doing the leftmost-capture.
    )

Another suggestion to mitigate the downside of 4 & 5, I may have another idea. You can let me know if you see a problem in theory… If the leftmost capture ( ) symbols were something like < > (or any non-conflicting pair of characters)… This would let the () groups happily coexist and remove the need to explicitly mark all the other simple groups as non-capture, making them easier to read and simpler to maintain.

If that is possible, I have one other extension that would be well suited to the new syntax and would simplify the rules for many uses cases… Often, it would be very handy to have a payload or breadcrumb that is associated with each captured span… maybe it conveys some contextual or semantic meaning. Currently, the only place to attach such a thing is in the action, but often the optimal place to associate that information is at the capture group level. And at exactly the same places in the code that it remember the group boundary, if could also remember a breadcrumb, which would be very helpful (and results in less action code, potential reducing the number of end states). Assumng the < > syntax, maybe that could have an optional payload/breadcrumb that is remembered at the same time, and when it does the final process to fill yypmatch, it has a 1-1 corresponding
storing of the remembered breadcrumbs in some array maybe yypayload (or whatever)… I think the simplest implemention of the payload being an integer value (unsigned|unsigned long|etc) would be sufficient for most things… And the payload itself could be a value (or optimally something like bitwise “or’ed" sequence of named constants (e.g. HEX|NUM) that result in the value value being remembered. In fact, the re2c would simply pass through the payload/breadcrumb and let the compiler resolve the expression.)

So maybe it looks something like this…

<’0x’[0-9]+> # this is a simple capture with no payload (or a payload of 0, if payloads are enabled)

<HEX|NUM:’0x’[0-9]+> # this is a capture with a payload… internally, it might look like this…
yyt1 = YYCURSOR; (or whatever the normal thing to do is)
yyp1 = HEX|NUM;

then at final something like
yypmatch[0] = yyt1;
yypmatch[1] = YYCURSOR;
yypayload[0] = yyp1;

Let me know what you think.

P..S. Even without the <> extension, I will definitely be using the new leftmost-capture option. Thanks again

@dtp555-1212 Thanks for your thoughtful answer!

Regarding 5, I also would prefer if the default parentheses in POSIX syntax were non-capturing, but we should respect the universal default (that simple parentheses are capturing), and we should keep --posix-captures and --leftmost-captures symmetric in that regard (because it will be the user's expectation). However, I can add an option --invert-captures that would flip the default, making the usual parentheses non-capturing and the marked ones capturing. It would be the re2c way, as we already have one default-flipping option for the syntax of case (in)sensitive string literals.

Regarding 4, if we go with --invert-captures option, then a bang seems like a good syntax: (! ...) instead of (? ...) as the question mark suggests optionality. I don't want to copy the verbose (?: ...) syntax as re2c does not attempt to be syntax-compatible with regexp libraries, and shorter syntax is a more important property here.

Regarding angle brackets for non-capturing parentheses, they might conflict with start conditions, and they are a bit too exotic.

Regarding payloads, what would you gain by having yypayload[i] available in the semantic action as compared to having only yypmatch[i] there? Adding extra work to transitions instead of semantic actions is bad for performance (semantic action is executed only once, and the actions on transitions may be executed multiple times, and potentially match a different rule in the end). Also note that due to the possibility of not matching a capture, you would need some sort of default value for yypayload.
Also, I think integer numbers is too specific. The only generic mechanism that would work for all users is attaching a mid-rule semantic action, but that is not possible in general (only at certain deterministic points in the regexp).

Let me know if you think --invert-captures would be useful, it shouldn't be difficult to add. It might also take care of the implicit entire-rule capture (meaning, we can make the entire-rule group non-capturing if --invert-captures is set).

Since you are OK with not maintaining compatibility with other regex parsers, I kind of like the ‘(!’ syntax as it stands out more than ‘(?’ and as you note, it does have a completely different meaning.

My observations are that in the named definitions section (e.g. num=[0-9]+) captures are the rarity and the definitions are more complex, so (?(?(?… (or (!(!(! ) just make it harder to read. When () exist they are just simple groupings. In the rule/action sections the captures are very common, but the rule are relatively simple, so an added explicit character for the capture conveys intent without adding a lot verbosity. (e.g. (!num) ‘ ‘ (!num))

The potential issues I see with the invert are…

  1. The default behavior is currently to capture the entire span, which is handy. So either that does not get effected by the flag OR each rule needs to have an explicit capture when the flag is inverted.
  2. having the invert flag independent of the grammar description, makes the definition depend on the user match the intent in the build, rather than being self-defining. Also when the programmer is reading the file they have to know the intended switch setting to understand the intent. Also, unlike case sensitivity, inverting the capture regions is not functionally similar… e.g.

(! xxx) ( yyy) (!zzz)
!=
(xxx) (!yyy) (zzz)

so flipping the switch would also require inverting all the captures in the grammar too.

With those issues in mind, a global invert, might not be as valuable in practice (unless the default behavior causes more work in the grammar). e.g. If I only had a global switch, I would set it to ‘not’ capture by default, and mark my captures explicitly with (! …. as this matches my observations above… (i.e. the least verbose in most of the file, and meaningful information (e.g intent to capture when ‘(!’ is wanted)

Hope that helps on that question.

As for the ‘payload’…

  1. I agree that just integer is limiting, but I didn’t want to muddle the concept or increase the effort to support ever kind payload.
  2. example of use…

hex=(‘0x’[0-9])+;
bin=(‘0b’[0-1])+;
// oct, float, commaFormated, …
num=(hex|bin| …);

num { return NUM; }

In the above, you lose the information of what type the number is. (OR you would have to write separate rules/actions for each type, rather than being able to simplify the grammar with reuse)… That brings up a question… Is there any performance advantage/penalty either at processing time or at runtime, when you have to enumerate all the variation of the rules OR being able to do the reuse of definitions? Is one better than the other or the same?)

If you had a payload… maybe it looks something like this…

hex=(!HEX: ‘0x’[0-9])+;
bin=(!BIN: ‘0b’[0-1])+;
// oct, float, commaFormated, …
num=(hex|bin| …);

num { return NUM; /* and payload is known in yypayload[0]}

Hope that helps

Let me know if you have any questions

I don't fully understand your problem with using --invert-captures:

With those issues in mind, a global invert, might not be as valuable in practice (unless the default behavior causes more work in the grammar). e.g. If I only had a global switch, I would set it to ‘not’ capture by default, and mark my captures explicitly with (! …. as this matches my observations above… (i.e. the least verbose in most of the file, and meaningful information (e.g intent to capture when ‘(!’ is wanted)

So you just add the flag globally (as an option) and use the (! ...) syntax for captures. Anyone looking at your grammar would probably guess that you flipped the default from the way you use yypmatch in semantic rules, or they would discover the flag.

It is the same story with --case-inverted, strings 'xyz' and "xyz" flip their case-insensitivity, so you need to know about the flag to know what they mean. And it's the same story with regexp libraries and BRE / ERE syntax in POSIX, you need to know the options in order to interpret the syntax.

I cannot make non-capturing the default behavior, since we already have --posix-captures where the default is capturing, and changing that would break backwards compatibility. I cannot also make the default different for --posix-captures and --leftmost-captures.

As for the ‘payload’…

I understand your use case, but you could just as well add an stag to differentiate between alternatives:

hex = '0x'[0-9]+;
bin = '0b'[0-1]+;
// oct, float, commaFormated, ...
num = @h hex | @b bin | ...;

num { return h ? HEX : b ? BIN : ...; }

Or use a capture around hex and bin and inspect yypmatch entries in the same way as tags (it would be slightly less efficient and than stags, because it would add the pairing tag at the end of each capture, which is not needed in this example).

However, if possible (and in this example it is), then it is much better to use distinct rules for alternatives.

Using tags in your example may be slightly less elegant, but it is a very special case, and functionally tags do just as well. Note also that it would be really difficult to implement payloads compared to the added value for most of the users. It would require changes throughout the whole parser (new syntax), all intermediate representations (new regexp construct, changes on NFA and DFA transitions) and finally codegen. Users would also need to learn the new syntax and understand how it can be used. And the added value is a bit of syntactic sugar, which does not really justify maintenance cost for all the added complexity. I hope that clarifies why I don't support the payloads idea.

I am not saying don’t add the invert option. I am saying in practice, ‘if’ the default is non-capture (which I think makes the most sense; meaning it makes for the most normal, compact & readable grammar description; e.g. only use (! to mark the specific captures).. I would never flip that switch. And if someone did flip that switch, unlike case-sensitivity, the results are completely different. (i.e. the stuff being captured is suddenly the portions between what was being captured in the defined grammar, which I see little utility for, whereas the case inversion, is very handy, that you can define an entire grammar as being case sensitive, and then toggle it with no change to the grammar, and you end up with a very usable grammar that is simply more tolerant, without having to chang the grammar itself. i.e. it would do what ‘exactly’ it did before and more.)

And one subtle point, since the current default does an unmarked capture for the entire span, flipping the switch either has to ignore the inversion and continue to do so OR honor the flag and do no capture unless explicit told to. Not insurmountable, just need to define what the behavior should be. I like the default as you have it, but adding one more explicit (! for each rule would work too, if needs be.

If you can’t make the default non-capture, then yes I would like to have the toggle to flip it to be so.

Hope that clarifies the point.

As for the payload, your ideas to use of stag and/or yyparse capture are interesting, but I would like to confirm a few things, as I see some potential issues…
For the stag idea…

  1. do I need to create all unique tag names for every breadcrumb/payload I want to leave, or only unique within a given parse path? When/how do the stags get initialized? If there are more than just a few of these OR if they are all initialed all the time, is that wasteful both processing time and at runtime? Plus the time to evaluate all of the tag variables in the path. My gut, says this is sub-optimal from a performance/space standpoint.

For the yypmatch idea…
2) Since everything that would have a payload already would have captures anywhere, I think the price has already been paid, so ‘maybe’ the only extra processing is in the rule to figure out what what actually filled, which would be required for stag idea too, but all the items to compare would already be in yypmatch? (Also, I think I remember seeing a mix of either NULLs or spans that had the same start/end tag indicating empty captures. Are both expected?)

And the other option I mentioned is to make redundant rules that only differ by the specific capture type. (e.g. one rule for hex, one for oct, one for ….) … Since this forces each to have a unique rule/action pair, what is the implication on processing time, runtime/space? Since I am anticipating a large grammar, I am a bit sensitive on the little things that add up when scaling, especially at runtime.

If the runtime performance/space is negligible for the redundant rules, that may be my best option lacking actual payload functionality. But, if you say it would have substantial waste, then my guess is that the yypmatch idea ‘may’ be a fallback, as the extra performance overhead is at least limited to the number of captures in the parse path, correct?

Tell me where you think my performance/space guesses are wrong.
So, if my understanding is correct, then I think you see the payload idea, is not really syntactic sugar, but a way to actually get smaller and faster parsers. (but, I understand that it indeed effort, so I totally understand if you don't want to implement such a thing in the mainstream re2c.)

And to confirm, other than the effort/maintenance, the idea is indeed possible, correct?

P.S. Speaking of performance/space, here are a couple of random observations…

  1. since many (if not all) of the captures start at the start of the input string, there may be an optimization of initializing yyt1 once at the top of the processor, instead of at the start of each capture state that represents the start of the input… for large grammars, this can make a big differences for space. (In hand crafted parsers I have made in the past, this trick worked very well.)
  2. Not related to capture: This may be known to you already, but it appears that the definition choices of the grammars that will result in the exact same result, have different processing time stats, sometimes substantially…

e.g.
P+ OR P{1,}
vs
P P*

The top two take about the same time to process, whereas the last form takes about twice as long to process.

Are there other known cases where different forms grammar syntax (that do the same thing) have a processing time pro/con?

Looking at some more generated code trying to find the optimal solution, I notice you don't merge 'identical' action states... (This would be a nice optimization for saving a lot of lot of space... in some examples almost a third of the code is the redundant ... setting of the yynmatch, yypmatch, and the code for the actual action (which are all the same)... Since I thought the system already did this optimization, 'payloads' were a way to avoid making the final action different. (And I can see that is why you thought it was just syntactic sugar, rather than reducing the program size 'substantially'. So, the penalty for enumerating all the rule differences is indeed pretty severe. Hope that gives some food for thought.

I have tried all the options and they all have some size/space penalty, HOWEVER, I think I have a way that, if you are going to fix the redundant code issue, that would 'not' require changes all the way through code stack, and "doesn't" change the syntax. First let me show you a small example of the redundant code, so you know what I mean…

prior to every user action, there can be a list of yynmatch and yypmatch assignments… often times these are redundant (and with lots of captures can be long)...

yy#:
yynmatch = 10;
yypmatch[0] = yyt1;
yypmatch[1] = YYCURSOR;

yypmatch[8] = yyt4;
yypmatch[9] = yyt5;
{ return 1; }

Note: this exact pattern may be repeated ‘many’ times in the result source file. Of course, for space saving, all the exact matches can be merged. Likewise, if all the yy settings are the same, even ‘some’ not exact user actions can be optimized bit ‘lifting out’ the starting difference, leaving a potentially ‘common/complex’ user action… e.g.

yy#:
yynmatch = 10;
yypmatch[0] = yyt1;
yypmatch[1] = YYCURSOR;

yypmatch[8] = yyt4;
yypmatch[9] = yyt5;
{ uniqueCode=1; return 1; }

BECOMES
uniqueCode=1;
goto yy### (a new common state that contains all the redundant code)

yy###: // this code is shared by potentially ‘many’ states
yynmatch = 10;
yypmatch[0] = yyt1;
yypmatch[1] = YYCURSOR;

yypmatch[8] = yyt4;
yypmatch[9] = yyt5;
{ return 1; }

This would give you a general purpose optimization… If you are not comforatble with a totally general optimization, the ‘unique code’ could be limited by convention… e.g. only the first statements, that match a specific pattern … e,g PAYLOAD[^;]; ... if those are found they are lifted out OR better yet, maybe the list of initial assignment statements up until any leave any complex or branching statements like if, switch, etc…)

Let me know what you think

do I need to create all unique tag names for every breadcrumb/payload I want to leave, or only unique within a given parse path?

Tag names should be unique within rule.

When/how do the stags get initialized? If there are more than just a few of these OR if they are all initialed all the time, is that wasteful both processing time and at runtime?

For a given rule, if it matches, all tags in it are guaranteed to be initialized (whether to an offset/pointer in the input, or to some default value). Tags of non-matching rules may not be initialized (but they also cannot be used in the semantic action of this rule).

Plus the time to evaluate all of the tag variables in the path. My gut, says this is sub-optimal from a performance/space standpoint.

It would be exactly the same with payloads. There is no way around the overhead on nondeterminism. It depends on the grammar: in some cases it is possible have just one tag assignment, in other cases multiple assignments are needed. Also, re2c performs quite a few tag optimizations on top of the TDFA model.

Since everything that would have a payload already would have captures anywhere, I think the price has already been paid, so ‘maybe’ the only extra processing is in the rule to figure out what what actually filled, which would be required for stag idea too, but all the items to compare would already be in yypmatch?

Yes, if you already have a tag there, you can just use the tag to set the payload. But this could also be easily done by the user from in the code of semantic action (and more generally, any other logic based on tag inspection). However for re2c to add the new syntax for payloads and pull it through all the intermediate representations to the codegen, it would be a lot of changes in the code.

(Also, I think I remember seeing a mix of either NULLs or spans that had the same start/end tag indicating empty captures. Are both expected?)

I'm not sure I understand. A tag can be NULL or non-NULL (a pointer in the input). Both values may or may not be expected depending on the regexp and the location of the tag in it (some tags can never be NULL). For a pair of tags that represent a capture, if they are both NULL it means the capture didn't match; if they are both non-NULL the capture did match (if they are equal, the capture matched and empty substring).

And the other option I mentioned is to make redundant rules that only differ by the specific capture type. (e.g. one rule for hex, one for oct, one for ….) … Since this forces each to have a unique rule/action pair, what is the implication on processing time, runtime/space? Since I am anticipating a large grammar, I am a bit sensitive on the little things that add up when scaling, especially at runtime.

Having separate rules is completely fine, I expect that it will be more efficient than using tags within one rule to differentiate between alternatives in semantic action of this rule. Separate rules don't add any work on transitions, they simply add separate final states to the DFA. And tags should be used for more subtle things when some information is needed within one rule.

So, if my understanding is correct, then I think you see the payload idea, is not really syntactic sugar, but a way to actually get smaller and faster parsers. (but, I understand that it indeed effort, so I totally understand if you don't want to implement such a thing in the mainstream re2c.)

I don't think it would lead to smaller and faster parsers. If I did think this way, I would be prepared to go an extra mile as re2c main focus is on generating fast code. But payloads must use the same mathematical model as tags (or just piggyback on existing tags), and therefore they would be less efficient than multiple rules.

And to confirm, other than the effort/maintenance, the idea is indeed possible, correct?

Yes, it would be possible to implement on top of tags (or in the same way as tags).

since many (if not all) of the captures start at the start of the input string, there may be an optimization of initializing yyt1 once at the top of the processor, instead of at the start of each capture state that represents the start of the input…

re2c already does compiler-like optimizations on tags, which includes:

  • hoisting tags from transitions to states
  • minimizing the number of tag variables (e.g. representing multiple tags with the same variable)
  • removing dead tag assignments
  • eliminating tags that are within fixed distance of another tag and can be computed basd on the other tag by adding a fixed offset
  • allocation of tag variables similar to register allocation with copy coalescing
  • DFA minimization

and some more. You can compile with --no-optimize-tags to see how much worse it would be without those optimizations.

Not related to capture: This may be known to you already, but it appears that the definition choices of the grammars that will result in the exact same result, have different processing time stats, sometimes substantially…

e.g.
P+ OR P{1,}
vs
P P*

The top two take about the same time to process, whereas the last form takes about twice as long to process.

Are there other known cases where different forms grammar syntax (that do the same thing) have a processing time pro/con?

re2c constructs an NFA and then converts it to a DFA. It does some primitive simplifications on the regexp before constructing the NFA, but it does not attempt to bring a regexp to some normal form (in the presence of tags such transformations could ruin user-defined semantics of the regexp).

Looking at some more generated code trying to find the optimal solution, I notice you don't merge 'identical' action states...

re2c does this since version 0.16 (DFA minimization). It also does this in the presence of tags. If you observe two states that are not merged, then it means that they are not equivalent (can lead to different outcomes).

in some examples almost a third of the code is the redundant ...

Please share this example (create a new issue and attach your source grammar and the generated code there as files, not inline comments, and also explain which identical states are duplicated).

So, the penalty for enumerating all the rule differences is indeed pretty severe. Hope that gives some food for thought.

No, I don't understand from a generic description. I need an example to see what you mean, and an modification of this example that would demonstrate how payloads will solve the problem.

Also, note that when optimizing for code size, you should rely on the stripped binary size, not the generated source code which may sometimes be misleading.

prior to every user action, there can be a list of yynmatch and yypmatch assignments… often times these are redundant (and with lots of captures can be long)...

re2c has no way of knowing which captures/tags the user is going to use in the semantic action, and which are not needed (it has no idea what happens in the semantic action, so it assumes that all tags and captures are needed). The user should only use capturing parentheses/tags if they need them. Also note that the compiler, contrary to re2c, is able to eliminate unnecessary assignments quite easily.

Of course, for space saving, all the exact matches can be merged.

No, they cannot, as the final states are all different (they have different semantic actions) and cannot be merged or have a common part. Even if you moved those assignments to a function, it would either get inlined, or it would introduce function call overhead, which is much worse for performance.

Likewise, if all the yy settings are the same, even ‘some’ not exact user actions can be optimized bit ‘lifting out’ the starting difference, leaving a potentially ‘common/complex’ user action… e.g.

As I explained above, re2c performs this optimization to the extent possible.

I think I see source of confusion.

a new common state that contains all the redundant code

This is the problem. Once jumping to that common state, how would you transition out out of it back to the correct states? You would either have to re-match the last character, or save the state somehow else and dispatch on it, which is costly.

In certain cases merging similar states can indeed be done (with rematching the character when transitioning out of the common state). This is called "tunneling" or "tunnel automaton" and re2c already does it where possible. In the presence of tags this is much harder to do since states differ also in tagged transitions, making them even less alike.

FYI, I wrote the post processor... I didn't do anything exotic to maximize the reduction, but just doing the straightforward as described, it reduced the lines of code by 561, and reduced the size of the executable by over 24K on the first file I ran it on (of course results will differ based on the grammar). It turned out to be a single pass, since the first instance of the code is just jumped to when/if it finds a redundancy. So, it never adds any overhead (other than a label) in code.

P.S. and when adding in the yyt1 optimization is saved over 25K

Can you attach an example of a .re file and the generated source code with / without the optimization, so that I can understand what your optimization does? Note that you can remove all the user-defined code and leave just the re2c section if you feel uncomfortable posting the original example here. I'm definitely interested to see what you've done, but it's hard to understand from a textual description without seeing the real-world code.

re.zip
In the zip file are before.re, before.c, after.c, and beforeAfter.dif
This in a small artificial example, that shows 3 unique rules, two of which are reduced (e.g. reuse the common code of the first).... There is nothing special about the word PAYLOAD (or the label prefix _com... I used it for commonCode, just so I could easily find it when I was writing the postprocessor)... Both can be anything, but since we have been using that term, I put it in the example. Hope that helps

Thanks! I will have a look.

@dtp555-1212 On a side note, I noticed is that you define tag variables manually (yyt1, yyt2 and others) --- don't do that, use the /*!stags:re2c ... */ directive to autogenerate them.

Looking at your example re.zip, I don't understand a few things:

  • In this example, you don't need neither tags/captures, nor payloads to make that PAYLOAD assgnment in the semantic action. Upon matching a rule, that information is readily available to you (e.g. that rule hex bin hex matches HEX, followed by BIN, followed by HEX).

  • If some parts of your rule were conditional, e.g. hex (bin | dec) hex, then you would need a tag to know whether bin or dec has matched. That could be easily achieved with a single stag, e.g. hex (@b bin | dec) hex { PAYLOAD[1] = b ? BIN : DEC; ... }.

  • If you use captures instead of stags, then a) it is inevitable that you will have two yypmatch entries per capture, and b) those assignments before semantic rules would be impossible for the compiler to eliminate (it is a write to memory yypmatch, not to a local variable like in the case of stags).

  • Finally, in your example after.re you saved space by completely removing yypmatch assignments. re2c cannot do that, because it doesn't know whether user-defined code will use them (or only some of them, or none of them). re2c assumes that if the user specified a capture, then the user needs the corresponding yypmatch values.

So unless you have a more complex real-world example, I don't see why you need tags or payloads at all. And if you need them, I still think that you could do just as well with stags.

Please understand that I'm not trying to disprove your optimization, I'm trying to understand your real-world use case and to see if it can be generalized to the common case. You can help by providing a real-world example.

I would probably not use the ‘global’ computed-goto, since the optimal balance of space/speed is based on the density of the goto table, rather than a one-size fits all

There is a configuration re2c:cgoto:threshold if you'd like to experiment with the balance (the default value is 9).

it would be preferred in the ‘accept case computed goto’ have its own switch to enable.

Or perhaps, a separate tunable threshold for accept states.

Again, the same request: please share your real-world example, or let me know why you cannot do that) Note that it doesn't have to compile as long as re2c can process it --- I only need the regexp grammar and semantic actions to see how you are using tags/captures or payloads, how many rules you have, etc.

I'm afraid, a larger example would only make things more confusing for you. so hopefully the above will click this time.

No, what is really confusing me is a small contrived example that don't show real-word use case. :)

So please attach the large example. If I get confused, it is not such a big problem.

FYI, I have tried as many combinations as possible of switches and thresholds I can think of to get the 'accept if then else chain' to change to computed-gotos with no success. (I can see the computed-gotos working for other things in the file)... So, either there is a combination that I missed, or there is an omission for that specific case. (FYI, there are over 800 lines of code in that section with the large binary search of the greater than 180 accept choices). If this worked, it would be 'very' fast, since it would go directly rather than having to do the binary search.

I think I have a clue on the accept case described above.... I think I have discerned a pattern... for the switch statements that contain 'no' extra code besides goto statements, they are converted. The other ones that have even only 1 or 2 cases that have an additional assignment(s) (e.g. yyt2 = NULL;) seems to be the thing that disables the conversion to use computed gotos for the entire switch statement block. I 'think' that is why the accept if/then/else case doesn't get optimized.

FYI, since I didn't hear back if you were interested in doing the optimizations, I went ahead and wrote a compiler that optimizes the output as described above (as well as generating better assembly than generic compilers on this particular use case). So, far I am seeing about a 20% speed up. So, since I have a solution, I don't need re2c to optimize its output. Thanks again

@dtp555-1212 There are a few changes I plan to make: add --invert-captures option and change syntax of non-capturing parentheses to (! ...), working on that now. Hopefully when it lands, it won't be too disruptive for you.

As for your optimization, I have asked you multiple times in this thread to provide a real-world example. It is essential that re2c development is driven by real-world use cases (and you can find many of them in the test suite --- they usually make the most interesting and complex test cases). This development rule is not something re2c-specific, e.g. the Linux kernel won't accept any major code changes without compelling use cases.

Anyway, I am glad you were able to optimize your code this way or that.

Option --invert-captures and the corresponding configuration re2c:invert-captures have been added, and the syntax for (non-)capturing parentheses is now (! ...).

I have been doing some exploration starting with first principles for potential optimizations. I am creating minimized epsilon-NFAs from regex... I avoid going to DFA which avoids the time and 'space' of that conversion. Early signs are very promising. I realize that would be major surgery for your code, so I don't expect you would do that anytime soon (if ever), but I thought I would share the idea with you and the community anyway.

Yes, I think re2c will always stay DFA-based. It's intended for small or medium-sized lexers (e.g. for a programming language grammar), where an optimized and compiled direct-code DFA is much faster than NFA, and not for super-huge grammars that make DFAs impractical.

There is an experiments libre2c with various NFA and DFA-based algorithms which shares the core codebase with re2c, it can be used as a regexp library.

v~7n!~DmA:1