Add syntax files to simplify porting re2c to new languages.

Question

Add syntax files to simplify porting re2c to new languages.

skvadrik opened this issue a year ago · comments

Syntax files should be config files that describe a language backend via a set of configurations. When generating code, re2c would map various codegen concepts to the descriptions provided by the syntax file. This way a new language can be added easily by supplying a syntax file (by the user or by re2c developers --- existing backends should be described via syntax files, distributed with the re2c source code and as part of a re2c installation).

The man difficulty is to decide on a minimal set of configurations that are orthogonal and capable of describing different languages, so that we don't have to add new ad-hoc configurations for each new language. Once this is decided, codegen subsystem should be modified to support syntax files, and exising backends should be rewritten using syntax files (before adding new ones).

Related bugs/commits:

JS: #420
OCaml: #449
Dlang: 742ff81
HLS: #294

Perry E. Metzger · Answer 1 · Sat Aug 26 2023 04:21:28 GMT+0800 (China Standard Time)

FWIW, this would be an amazingly cool feature.

Perry E. Metzger · Answer 2 · Tue Oct 10 2023 01:34:00 GMT+0800 (China Standard Time)

Just wanted to say, again, this would be an amazingly cool feature.

Ulya Trofimovich · Answer 3 · Tue Oct 10 2023 01:49:09 GMT+0800 (China Standard Time)

I started some experimental work on this. I'm constrained on time at the moment, so it's not moving fast, but it's my next most important goal for re2c.

Ulya Trofimovich · Answer 4 · Tue Dec 19 2023 19:07:51 GMT+0800 (China Standard Time)

An update. I've been doing some experimental work on syntax-file branch, and I'm now at the point when all three existing language backends (C/C++, Go, Rust) can be expressed via config files: https://github.com/skvadrik/re2c/tree/syntax-files/include/syntax. There language option has been removed from re2c, meaning that the codebase is now clear from any language-specific checks: 80bb7db (any language-specific behaviour is based on syntax configs). For the three existing backends, config files are built into re2c so that there's not need to pass any syntax configs (the --lang option works as before, as well as re2go and re2rust binaries), so this is all backwards compatible.

Ulya Trofimovich · Answer 5 · Tue Dec 19 2023 19:18:01 GMT+0800 (China Standard Time)

Next thing for me is to write a syntax config for D (very close to C/C++) and resurrect test cases from #431.

Feedback on the current DSL I used in syntax file is welcome. Although it's not documented yet, but it shouldn't be very hard to understand. There are different groups of configurations:

list configurations
single-word configuration
code configurations (those starting with code:)

The first two groups are simple, and the last group is basically templates for language constructs that are used in different parts of codegen. The DSL allows conditionals (condition ? it-yes-part : if-no-part), list generators that loop for a specific list variable [var: loop-body], strings and simple variables. All variables are in essence callbacks to re2c that get substituted with code fragments. Configurations in syntax files respect re2c options and configurations inside of the re2c blocks in the input file.

Ulya Trofimovich · Answer 6 · Tue Dec 19 2023 19:18:59 GMT+0800 (China Standard Time)

This is all experimental work, all configurations are subject to change while they are on syntax-files branch.

Perry E. Metzger · Answer 7 · Wed Dec 20 2023 02:16:01 GMT+0800 (China Standard Time)

Handling a couple of more diverse languages (OCaml, Python) might be an interesting test here. I'm not sure I entirely understand how the init file works btw.

Ulya Trofimovich · Answer 8 · Mon Dec 25 2023 15:37:06 GMT+0800 (China Standard Time)

fbb0975 adds support for D language (in the form of a syntax file).

Ulya Trofimovich · Answer 9 · Thu Mar 21 2024 21:16:33 GMT+0800 (China Standard Time)

Dlang support was added in d492026.

OCaml support was added in c1ccefa (see discussion in #449).

Now, as suggested by @pmetzger I started looking at python. Basic example (with a custom syntax file, not shared here):

/*!re2c
    re2c:define:YYFN = ["lex;", "str;", "cur;"];
    re2c:define:YYPEEK = "str[cur]";
    re2c:define:YYSKIP = "cur += 1";
    re2c:yyfill:enable = 0;

    number = [1-9][0-9]*;

    number { return True }
    *      { return False }
*/

def main():
    str = "1234\x00"
    if not lex(str, 0):
         raise "error"

if __name__ == "__main__":
    main()

The generated code looks like this:

# Generated by re2c

def yy0(str, cur):
        yych = str[cur]
        cur += 1
        if yych <= '0':
                return yy1(str, cur)
        elif yych <= '9':
                return yy2(str, cur)
        else:
                return yy1(str, cur)


def yy1(str, cur):
        return False

def yy2(str, cur):
        yych = str[cur]
        if yych <= '/':
                return yy3(str, cur)
        elif yych <= '9':
                cur += 1
                return yy2(str, cur)
        else:
                return yy3(str, cur)


def yy3(str, cur):
        return True

def lex(str, cur):
        return yy0(str, cur)



def main():
    str = "1234\x00"
    if not lex(str, 0):
         raise "error"

if __name__ == "__main__":
    main()

Does it look reasonable? I plan to use recursive functions code model by default, but loop/switch model should work as well. Which one is preferable? Do function calls add much overhead in python? I'll do some benchmarks myself later, but I'm curios to hear what others think.

Perry E. Metzger · Answer 10 · Fri Mar 22 2024 21:33:09 GMT+0800 (China Standard Time)

Python is interpreted, and doesn't have much of an optimizer. I suspect loop switch will be faster, but I don't know for sure. Benchmarks will be needed.

Another thing about python: it has a // operator, which potentially might be mistaken for a comment. Generally, I think that it might be good if the comment character for a particular language could be defined rather than using the default.

Oh, and lastly: python has optional type annotations. Those might be helpful in the generated code for those using mypy.

Ulya Trofimovich · Answer 11 · Sat Mar 23 2024 20:33:44 GMT+0800 (China Standard Time)

Huh, I got RecursionError: maximum recursion depth exceeded on one example (not even a big one). In compiled languages I enforced tail recursion (ether in the form of annotation, or optimization level), but in python I think nothing can be done, the only way is to go with loop/switch. Am I missing something?

Perry E. Metzger · Answer 12 · Mon Mar 25 2024 09:17:39 GMT+0800 (China Standard Time)

@skvadrik Oh! Python does not have tail recursion. I had not noticed how you were doing it, if you want to use recursion for this in Python you need a trampoline function so that you don't infinitely recurse. I guess using match (the equivalent of switch) is pretty much what you would need to use if you don't use a trampoline.

Ulya Trofimovich · Answer 13 · Mon Mar 25 2024 15:26:45 GMT+0800 (China Standard Time)

For reference, https://github.com/0x65/trampoline describes how to do trampolines with python.

Ulya Trofimovich · Answer 14 · Thu Apr 04 2024 01:08:25 GMT+0800 (China Standard Time)

Python support was added in 95b916d (based on loop/switch mode).

Update: commit changed after force-push: 63c775a

Perry E. Metzger · Answer 15 · Thu Apr 04 2024 02:43:08 GMT+0800 (China Standard Time)

Just looked at the python example, it seems pretty reasonable.

Ulya Trofimovich · Answer 16 · Mon Apr 15 2024 19:36:39 GMT+0800 (China Standard Time)

Vlang support was added in 73853c5.

Perry E. Metzger · Answer 17 · Thu May 16 2024 21:45:21 GMT+0800 (China Standard Time)

So I find myself wanting to use the Python support. I'm an adult and understand that all the syntax etc. for syntax files may change in the future. Could a suitable version of re2c get tagged (perhaps not officially released) for people who want to experiment with real code?

Ulya Trofimovich · Answer 18 · Thu May 16 2024 22:03:58 GMT+0800 (China Standard Time)

So I find myself wanting to use the Python support. I'm an adult and understand that all the syntax etc. for syntax files may change in the future. Could a suitable version of re2c get tagged (perhaps not officially released) for people who want to experiment with real code?

Use this: https://github.com/skvadrik/re2c/releases/tag/python-experimental. I previously rebased git history so that all python-specific work goes before it, and I shouldn't break git history up to this commit with my future changes.

Ulya Trofimovich · Answer 19 · Thu May 16 2024 22:06:59 GMT+0800 (China Standard Time)

@pmetzger It will be very helpful if you try it out and report any issues. :)

Ulya Trofimovich · Answer 20 · Mon May 27 2024 18:29:59 GMT+0800 (China Standard Time)

Haskell support was added in 4e78ef8. The configurations have to be a bit more verbose, as even simple operations have to update lexer state and propagate it further down the program (see https://github.com/skvadrik/re2c/tree/syntax-files/examples/haskell). I'm thinking that this can benefit from language-specific default API (so far it only exists for the C/C++ backend, but the definitions are now all in syntax files, so each syntax file may provide its own default API). There are monadic and pure styles for Haskell.

Ulya Trofimovich · Answer 21 · Wed Jul 03 2024 21:43:05 GMT+0800 (China Standard Time)

Java support was added in e2facbf. Update: rebased as 2dd0de3.

Unlike other languages, there is no good default implementation for YYPEEK in Java as it has very different syntax for strings and arrays. Therefore YYPEEK is left for the user to define even in default and record APIs.

Perry E. Metzger · Answer 22 · Fri Jul 05 2024 08:59:29 GMT+0800 (China Standard Time)

Java support was added in e2facbf.

🔥

Ulya Trofimovich · Answer 23 · Mon Jul 08 2024 16:33:36 GMT+0800 (China Standard Time)

JS support was added in 74ace08.

Ulya Trofimovich · Answer 24 · Tue Jul 16 2024 05:20:40 GMT+0800 (China Standard Time)

Zig support was added in 5cd48a8.

Ulya Trofimovich · Answer 25 · Tue Jul 16 2024 05:26:43 GMT+0800 (China Standard Time)

My further plan is to focus on polishing syntax file API (and who knows - maybe eventually even releasing it :D). If you have other interesting languages in mind, please mention them in this thread - the API is not frozen yet and it's possible to change it. That said, for the last three languages (Java, JS, Zig) no changes were needed, which means it should be expressive enough (at least for C-like languages).

Perry E. Metzger · Answer 26 · Tue Jul 16 2024 06:30:56 GMT+0800 (China Standard Time)

My main issue remains the "comment syntax" for the re2c blocks, but I will confess I haven't dived in deeply enough to things like the API. Maybe I should.

One option is to do a release soon but make the support for languages using the syntax files "experimental" to get more widespread feedback.

Ulya Trofimovich · Answer 27 · Tue Jul 16 2024 14:14:50 GMT+0800 (China Standard Time)

@pmetzger I rebased syntax-files branch and pulled it into master. Sorry if I broke your workflow. From now on just use master - I will keep merging syntax-files into it.

Ulya Trofimovich · Answer 28 · Tue Jul 16 2024 14:27:10 GMT+0800 (China Standard Time)

My main issue remains the "comment syntax" for the re2c blocks, but I will confess I haven't dived in deeply enough to things like the API. Maybe I should.

I will give it more thought.

One option is to do a release soon but make the support for languages using the syntax files "experimental" to get more widespread feedback.

It's not the re2c way to break backward compatibility, if possible to avoid it - I don't think we have a big enough community to get timely feedback.

Perry E. Metzger · Answer 29 · Tue Jul 16 2024 22:30:25 GMT+0800 (China Standard Time)

So on the comments: there are going to end up being languages where // or /* is valid syntax. (For example, in Python, // is the integer division operator.) It feels safer to be able to use comments that make sense in the context of a given language.

Ulya Trofimovich · Answer 30 · Tue Jul 16 2024 23:05:35 GMT+0800 (China Standard Time)

So on the comments: there are going to end up being languages where // or /* is valid syntax. (For example, in Python, // is the integer division operator.) It feels safer to be able to use comments that make sense in the context of a given language.

That's a good point about syntax clash: I don't think it's a problem for the opening comment /*!re2c, as it is too specific, but the closing comment */ may be a problem.

Language-specific lexer will be hard to implement. At the moment lexer is written in re2c, and I'd like to keep it this way both for dogfooding and performance reasons.

Also, not all languages have multiline comments.

Instead of trying to use language-specific syntax, we can do what lex and bison do: use syntax that fits equally bad into any language, namely %{ and %}. These are already partially supported by re2c, so it will be natural to extend them, it will be familiar for the users (at least to some extent) and it shouldn't be that hard to implement. What do you think?

What I'm more worried about are single quotes (some languages allow them as parts of identifiers, labels, etc.). Syntax files already have some configurations that tell re2c whether to expect single quotes, backtick-quoted strings, etc.

Perry E. Metzger · Answer 31 · Tue Jul 16 2024 23:17:58 GMT+0800 (China Standard Time)

Instead of trying to use language-specific syntax, we can do what lex and bison do: use syntax that fits equally bad into any language, namely %{ and %}. These are already partially supported by re2c, so it will be natural to extend them, it will be familiar for the users (at least to some extent) and it shouldn't be that hard to implement. What do you think?

I think that's certainly an option, especially if that can be shifted to an alternative in the unlikely event that a specific language is using that specific bracket pair for real syntax.

What I'm more worried about are single quotes (some languages allow them as parts of identifiers, labels, etc.).

ML descended languages use them to identify type variables. Lisp uses them to identify unevaluated forms.

Perry E. Metzger · Answer 32 · Tue Jul 16 2024 23:20:33 GMT+0800 (China Standard Time)

It occurs to me that, with very high likelihood, nothing is ever going to use %RE2C{ and %RE2C}

Ulya Trofimovich · Answer 33 · Tue Jul 16 2024 23:29:21 GMT+0800 (China Standard Time)

I think that's certainly an option, especially if that can be shifted to an alternative in the unlikely event that a specific language is using that specific bracket pair for real syntax.

Exactly, that's the way it already works. We just need to extend %{ and %} to cover directives like /*!stags:re2c and prefixes like /*!local:re2c. Also at the moment it requires -F/--flex-syntax option - we'll need to drop that requirement.

ML descended languages use them to identify type variables. Lisp uses them to identify unevaluated forms.

Good, let's keep a list of all such cases and gradually add support for them in the lexer (it already knows about some). So far there's one boolean-valued configuration standalone_single_quotes in syntax files that turns on a bit of lexer logic that tries to parse what comes after the single quote as a label (that can be extended to look for an identifier, etc.).

Perry E. Metzger · Answer 34 · Tue Jul 16 2024 23:39:39 GMT+0800 (China Standard Time)

Lisp will do both things like '(a b c) and 'a (both values that are treated as unevaluated constants), while ML will do both 'a (type variable) and sometimes 'a' (character constant.)

Ulya Trofimovich · Answer 35 · Thu Jul 18 2024 22:51:39 GMT+0800 (China Standard Time)

I added flex-style start/end markers %{, %{rules, %{stags, etc. in 2d37b92.

See python and haskell examples, and I will port OCaml next (maybe Zig as well, as it has no C-style multiline comments /* ... */ - the rest of the languages are fine).

Perry E. Metzger · Answer 36 · Fri Jul 19 2024 00:59:46 GMT+0800 (China Standard Time)

Nice! I'm curious why you're allowing arbitrary text before the %{?

Ulya Trofimovich · Answer 37 · Fri Jul 19 2024 02:29:38 GMT+0800 (China Standard Time)

Nice! I'm curious why you're allowing arbitrary text before the %{?

I think it's useful (it saves space) to allow staring a block in the middle of a line, e.g.:

    while True: %{
        // ... re2c code
    %}

Perry E. Metzger · Answer 38 · Fri Jul 19 2024 04:16:15 GMT+0800 (China Standard Time)

Makes sense. I also see (given that this is using an re2c regex) why it would be hard to have several different flavors of braces etc. I almost wonder if adding one more character (something like %!{ or whatever) might be good to make a later conflict with a given programming language even more unlikely.

Ulya Trofimovich · Answer 39 · Sun Jul 21 2024 14:18:00 GMT+0800 (China Standard Time)

I almost wonder if adding one more character (something like %!{ or whatever) might be good to make a later conflict with a given programming language even more unlikely.

I think &{ should be fine, given that there is another option (/*!re2c).