alecthomas / participle

A parser library for Go

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature Request: Custom Rules

meln5674 opened this issue · comments

Background: I am working on an experimental language where, for the most part, simple regular expression rules are capable of lexing the source, but after certain tokens, the next token requires complex logic to identify before returning to the regular expression rules.

Problem: As far as I can tell, this is impossible without defining my own lexer.Definition and lexer.Lexer instances from scratch, and it is not possible to re-use the existing functionality from lexer.StatefulDefinition, without simply copy-pasting it.

Proposed Solution: Extend StatefulDefinition's Rule (or make StatefulDefinition a subset of a more comprehensive API) to anything that is sufficiently "regex-like", that is, can accept the parent state's name and captured groups, as well as the input data, then either terminate the lex, report no match, or report the end point of the next token and the action to take.

I have a very rudimentary proof-of-concept here, which is not backwards-compatible, breaks all of the tests, and isn't particularly well-written, but nonetheless works.

Would you be interested in working together to implement this in a way consistent with the current API, or would you prefer I maintain my own fork?

I'd like to see an example of some of the syntax you're referring to first.

Unfortunately, I can't give concrete examples, as the project isn't open source (yet). Without giving too much away, consider a heredoc-like syntax where the A) the inner language is not expressible as a regular grammar, and B) the heredoc terminator is only accepted if it is located in certain points within the sub-language, otherwise, it consumed as part of the sub-language, and there must be another terminator located elsewhere. As a result, once the heredoc starts, there has to be custom logic to figure out where it ends, and then to validate that what's in between is even allowable, and if not, lexing (not parsing) terminates. If it weren't for point (B), a .* with a backreference could probably capture it, but without knowing if that opaque string is valid or not means it can't be correctly checked as a token or not, and capturing too early may result in an invalid lex.

I'm not necessarily opposed to the stateful lexer being extensible, but I won't accept a backward breaking change. From briefly looking at the your code, I would suggest looking at extending Action to support your use case.

That said, without any concrete examples/tests showing use-cases, I won't accept it either.

Of course. Like I mentioned, this was a quick "What if?", and any actual PR I would submit would be backward compatible, with documentation, test coverage, and no regressions.

Given that none of the methods of Action are exported, I'm not sure I follow you suggestion, and even looking at the unexported method, I don't see a simple way to have it generate additional tokens, but perhaps I misunderstand. Are you suggesting to export Action's method, and modify it to optionally return tokens as well as modify the state?

I'm proposing you extend the private Action interface to support your requirements, or add another optional interface similar to how RulesActions works. Then expose that functionality via a public function similar to the existing ones, such as Pop, etc.

Ah, and rules must also be serialisable to JSON.

Before you do anything, you should extract a representative example (obfuscated if necessary) and include it in this issue.

And perhaps an example of how you would use this proposed new functionality to lex it.

On the serialization note, is that just for diagnostic purposes, or does it need to be able to round-trip? My goal is to be able to inject an arbitrary function to execute, like in the linked fork, which obviously wouldn't be able to round-trip without having some sort of global lookup table to register functions to on initialization.

It needs to be able to round-trip, but I think for this case it could just return an error.