Streamline long term maintenance of lexer/parser code generators

Question

Streamline long term maintenance of lexer/parser code generators

alecthomas opened this issue 2 years ago · comments

I have some concerns with the long term maintenance of both the lexer and parser code generators which we should solve (or at least, not make worse) before getting these changes into master.

Problem 1: divergence

My first concern is that as the runtime code evolves, it diverges from the code generators. This has already occurred with the lexer, where the generated code does not handle lookahead. There will need to be some tests to detect this somehow. For the parser, enumerating the nodes and ensuring they're all handled by the generator (somehow) might work.

Conformance tests for lexer
Conformance tests for parser

Problem 2: ergonomics

There are also some ergonomic issues with generating the code. Specifically, having to have an implementation of the runtime code lying around as the "source of truth".

Proposal: serialisable form for both lexers and parse trees

Serialisable lexer
Create participle gen lexer
Serialisable parser
Create participle gen parser

One solution to this is for the code generators to be decoupled completely from the code. The lexer and parser would be extended to be JSON marshallable and the code generators would become standalone binaries that could ingest this serialised form and output code. This might be non-trivial for the parser because it is tightly coupled to reflection - TBD.

Another option would be to make standalone binaries that parse the Go code directly, making the Go AST and compile-time type system the intermediate form. This would be much more complicated though.

Alec Thomas · Answer 1 · Sun Sep 04 2022 18:51:16 GMT+0800 (China Standard Time)

This has been implement for the lexer in bfe7c69.

klondikedragon · Answer 2 · Thu Sep 22 2022 13:49:24 GMT+0800 (China Standard Time)

@alecthomas - This is a nice separation to have!

I've tested the update in bfe7c69, it looks like there is one more issue in the code generated from the serialized JSON definition of the lexer. Imagine that the lexer codegen was invoked using something like:

~/go/bin/participle gen lexer --name GeneratedLexer lexer < internal/lexer/lexer.json | gofmt > internal/lexer/lexer_generated.go

And imagine that the lexer definition had included some rules that looked like:

	"AllStrings": {
		{Name: `DoubleQuoteStringBegin`, Pattern: `"`, Action: lexer.Push(`DoubleQuoteString`)},
		{Name: `SingleQuoteStringBegin`, Pattern: `'`, Action: lexer.Push(`SingleQuoteString`)},
		{Name: `BacktickStringBegin`, Pattern: "`", Action: lexer.Push(`BacktickString`)},
	},

The codegen template that generates code to push a new lexer state intends to refer to the state struct name that includes the name specified on the command line (e.g., GeneratedLexer in the above example)
bfe7c69#diff-5119a11cbfb79273b05b2942dfb351ab4bf736ca7cf91d773f360e75bcb07eb4R173

However, what the template generates instead is to put the state struct name using the name of the rule instead of the top level struct name. For example, it generates

	switch state.name {
	case "AllStrings":
		if match := matchDoubleQuoteStringBegin(l.s, l.p); match[1] != 0 {
			sym = -22
			groups = match[:]
			l.states = append(l.states, lexerDoubleQuoteStringBeginState{name: "DoubleQuoteString"})
		} else if match := matchSingleQuoteStringBegin(l.s, l.p); match[1] != 0 {
			sym = -23
			groups = match[:]
			l.states = append(l.states, lexerSingleQuoteStringBeginState{name: "SingleQuoteString"})
		} else if match := matchBacktickStringBegin(l.s, l.p); match[1] != 0 {
			sym = -24
			groups = match[:]
			l.states = append(l.states, lexerBacktickStringBeginState{name: "BacktickString"})
		}

In all cases in this example, the proper name of the struct would have been lexerGeneratedLexerState.

Manually changing this the generated code compiles as before.

I've also been debugging why the generated lexer isn't producing 100% accurate results vs the reflection lexer. It seems it may not be handling the special lookahead word boundary \b quite the same, and in another case it may not be greedily matching the same in a regexp with a complex set of nested non-matching groups. I have to have some simple rules that highlight the gaps in the codegen lexer shortly.

klondikedragon · Answer 3 · Sat Sep 24 2022 11:43:31 GMT+0800 (China Standard Time)

@alecthomas - here's a one-liner fix so the generated lexer code with multiple states will now compile: #268

I'm working on some tests to add that will highlight some of the issues I've found in the generated lexer and maybe fixes for them if not too complicated.

Alec Thomas · Answer 4 · Sat Sep 24 2022 12:41:40 GMT+0800 (China Standard Time)

Magic, thanks!

Alec Thomas · Answer 5 · Sat Sep 24 2022 12:42:48 GMT+0800 (China Standard Time)

re. your last point, yes the generated lexer has diverged in terms of functionality already. I need to come up with a way for tests to pick this divergence up in future.

klondikedragon · Answer 6 · Sun Sep 25 2022 02:31:55 GMT+0800 (China Standard Time)

@alecthomas - Check out #269 which is my attempt to have the current lexer test suite also cover generated lexers, and to make it easy to add new test cases that test both the reflection lexer and generated lexer without much duplicate code. The generated lexer is so much faster than the reflection lexer, I'm hoping this will make it easier to maintain at least a subset of features in the generated lexer. Participle has unparalleled developer ergonomics, and if we can get generated lexer & parser production-ready, it will have leading performance as well!

Alec Thomas · Answer 7 · Mon Sep 26 2022 09:04:49 GMT+0800 (China Standard Time)

While I very much appreciate the effort, and I definitely don't want to discourage you from further contributions, I would prefer something a bit simpler to maintain, and a bit more targeted. I'm imagining something like a single "specification" lexer that checks against a spec, and cross-checks a generated lexer against the runtime lexer. I'm happy to flesh this out further if you are interested in contributing it.

klondikedragon · Answer 8 · Tue Sep 27 2022 13:32:49 GMT+0800 (China Standard Time)

If I understand what you're envisioning correctly, it would be some single set of lexer rules that would try to exercise all of the different features of the lexer, and then different unit tests that would use that one generated lexer to try and cover all parts of the lexer? It seems possible but to me harder to maintain than the proposed approach, as it would require a different set of unit tests than what is currently there (current unit tests start from different lexers with one specific element its testing).

Perhaps I'm missing how a single generated test lexer might be easier to maintain? Right now running the Makefile will immediately regenerate all of the JSON files and all of the generated test lexers, and the generated lexers also match the current unit tests so they can stay in sync as the unit tests change together. If there were to be one specification lexer that tried to cover all parts of the lexer it seems that would duplicate a lot of the work of the current test suite and would have to be maintained as well as the lexer evolves.

Having the goal of easy maintenance makes a lot of sense to keep time burden of any changes low. I'm interested in helping and open to ideas how to help in that direction!

Alec Thomas · Answer 9 · Tue Sep 27 2022 14:35:26 GMT+0800 (China Standard Time)

Right now the lexer tests are pretty ad-hoc - it's basically arbitrary if a particular feature is exercised or not. The new lexer would be a more systematic approach to this, and once complete, we could remove overlapping tests from the existing suite. I would envisage that there would be only one test input that tests every feature of the lexer, though that might be a bit hard to debug, so that might not be a great idea. It would need some experimentation.

I think it would also be good if the proposed test system generated the codegenned lexer on the fly for each test run, probably to a temporary directory, so that no manual steps are required when updating the lexer/tests. This would also function as a kind of integration test for the code generation itself.

Alec Thomas · Answer 10 · Tue Sep 27 2022 15:31:42 GMT+0800 (China Standard Time)

I'll have a bit of a poke around tonight.

Alec Thomas · Answer 11 · Tue Sep 27 2022 17:48:57 GMT+0800 (China Standard Time)

This is basically what I had in mind. Interestingly, these tests shouldn't attempt to capture validation errors (ie. malformed stateful lexers), those should remain in the existing test suite.

Alec Thomas · Answer 12 · Tue Sep 27 2022 18:17:54 GMT+0800 (China Standard Time)

Actually I think these should be more correctly called conformance tests.

klondikedragon · Answer 13 · Wed Sep 28 2022 10:53:15 GMT+0800 (China Standard Time)

@alecthomas - Your approach that will regenerate the lexer as part of the test is certainly nice. And the use of build tags to avoid causing the main package build to break if the generated lexer is missing/broken is also very nice.

klondikedragon · Answer 14 · Sun Oct 16 2022 10:43:25 GMT+0800 (China Standard Time)

@alecthomas - anything I might be able to do to help with the conformance tests & generated lexer? in the draft PR it looks like you were able to make some substantial progress on the gen lexer supporting backreferences. If that was finished up (looks like pretty extensive modifications) I could add some tests for \b and try to fix that, and possibly some things I've found on grouping.

Alec Thomas · Answer 15 · Sat Oct 22 2022 21:19:27 GMT+0800 (China Standard Time)

Hey sorry for the delay in responding. I did make a bit of progress on backrefs, as it turns out I had prepared for them in the generated lexer but for some reason didn't complete it. I didn't have time to completely finish support, so I've disabled the test for now in order to get that PR in.

Now that it's in, any further improvements you'd like to make would be most welcome!

Alec Thomas · Answer 16 · Sat Oct 22 2022 21:20:55 GMT+0800 (China Standard Time)

I would also really like to come up with some automated way to catch these kinds of issues in the future. That is, new features being added to the runtime lexer that aren't also added to the generated lexer. However, I can't think of a reliable way to do that, so for now we'll just have to try and remember to add new features to the conformance tests.

klondikedragon · Answer 17 · Wed Oct 26 2022 11:17:31 GMT+0800 (China Standard Time)

@alecthomas - see what you think of the approach in #274. It just adds a real basic test as a first example to see if you are OK with the direction.

Hopefully the community will add extensively to the conformance tests over time, and then as the generated lexer (and future gen parser) are incorporated into other projects, their unit tests will catch problems early too. And then any problems that are found the community could contribute specific tests to the conformance test suite, etc. My project has hundreds of test cases over its specific lexer/parser and are catching some specific issues I'll be adding (and trying to fix if possible) to the conformance suite, and hopefully lots more projects will do the same as the gen lexer is becoming more ready.

So it feels a little bit like a chicken and egg problem, but as the genlexer is getting good ergonomics now, hopefully it will get more adoption and better coverage and so on!

klondikedragon · Answer 18 · Wed Oct 26 2022 14:06:51 GMT+0800 (China Standard Time)

I've updated #274 to fix 3 different bugs in the generated lexer and added conformance tests that cover each of these cases. See what you think.

Petee · Answer 19 · Tue Nov 08 2022 21:35:42 GMT+0800 (China Standard Time)

@alecthomas can you give a more specific example of what adopting the idea of serializing the parser grammar nodes would look like? The reflection-related stuff would indeed be tricky to serialize. Trying to serialize every bit of information about the reflected types the generator might possibly need is probably not only too fragile, but may even be impossible in some cases (custom.parseFn).

I'm not sure such large effort would indeed improve usability & convenience anyway... Serializing the grammar to JSON would still be a very manual step requiring the user to code something. True, it would be something very common (JSON serialization and outputting it to a file), but I feel like adding some tooling to easily generate the code & store it to a file by calling some generic function (or a method on the build grammar) would be less work for the user.

Alec Thomas · Answer 20 · Sun Dec 04 2022 07:33:42 GMT+0800 (China Standard Time)

Another option would be to make standalone binaries that parse the Go code directly, making the Go AST and compile-time type system the intermediate form. This would be much more complicated though.

Having now experimented with this some more (go/types and golang.org/x/tools/go/packages specifically), I find this more and more appealing. I have not fully thought through how this would work in practice.

Related, my idealistic perfect solution would be that there wouldn't be any runtime dependencies on Participle at all. Whether that's feasible or not, I'm not sure, but with a generated lexer and generated parser, the runtime dependencies would be fairly minimal. Probably only the PeekingLexer, Token, and Error types, at that point.

Petee · Answer 21 · Sun Dec 04 2022 07:42:42 GMT+0800 (China Standard Time)

I've played with packages too and I'm less optimistic it can be done. Maybe yes, but I'm not sure the user experience for generating code would be significantly better.

Regarding the other point, I also think it would be cool to minimize the dependencies of generated code on the participle package. On the other hand, I also quite like the approach I took, where enabling the generated code is just a matter or passing an option to Build and then being able to use the existing Parse* methods, which already take care of a lot of stuff the generated code doesn't.

Alec Thomas · Answer 22 · Sun Dec 04 2022 09:06:31 GMT+0800 (China Standard Time)

Regarding the other point, I also think it would be cool to minimize the dependencies of generated code on the participle package. On the other hand, I also quite like the approach I took, where enabling the generated code is just a matter or passing an option to Build and then being able to use the existing Parse* methods, which already take care of a lot of stuff the generated code doesn't.

I think let's just go with this approach. We can always refine it or alter approaches later.