A simple fork of OCaml parser with support for error recovery
The recovery generator is implemented in menhir-recover
.
It is a binary that consumes a cmly
file and produces an OCaml source module that contains recovery heuristics suitable to use by the Merlin_recovery
module.
cmly
file are produced by menhir when you pass the --cmly
flag. It is a marshalled representation of the grammar and the automaton suitable for post-processing a grammar. It is a bit like a ppx
that post-processes an OCaml AST, although here no rewriting happens. Rather, information can be programmatically extracted from a parser.
The recovery runtime engine works with the --table
engine of Menhir with --inspection
extension. The recovery generator needs to postprocess the cmly
file.
Therefore, the menhir
rule in dune should be given the right flags:
(menhir
(flags --inspection --table --cmly)
(modules parser))
A table-based parser requires the menhirLib
library, so add (libraries menhirLib)
to your dune target.
After that, we need to derive a recover.ml
module associated to the parser:
(rule
(targets recover.ml)
(deps parser.cmly)
(action
(with-stdout-to
%{targets}
(run ../menhir-recover/main.exe parser.cmly))))
That's all on the build system side.
By default, Menhir provides a simple ocamlyacc
-style parsing function for each entry-point:
val entry_point : (Lexing.lexbuf -> token) -> Lexing.lexbuf -> sem_value
It takes a lexing function, a lexing buffer, and returns a semantic value (generally an AST) if successful, or raises an exception.
This is sufficient for simple parsing tasks but quite inflexible:
- either to whole task succeeds, and you get access to the AST, or it fails and all intermediate computations are lost
- parsing in synchronous: you cannot suspend parsing while waiting for input, parse inside Lwt, etc.
- ...
In other word, Menhir grabs control. To lift these limitations it also provides an incremental interface that gives back control to the programmer.
The checkpoint
type represents the state of the parser. When you have a token, you can offer
it to the parser and you will get a new checkpoint that represents success, failure or another intermediate parser state that needs more tokens. A checkpoint
is pure (up to user-procided semantic actions), so if you keep track of previous checkpoints you can backtrack and explore different parsing futures.
A simple parsing loop can be implemented like that:
module I = Parser.MenhirInterpreter
type 'a positioned = 'a * Lexing.position * Lexing.position
let rec loop tokens cp =
match cp with
| I.Accepted x -> Some x
| I.Rejected -> None
| I.Shifting (_, _, _) | I.AboutToReduce (_, _) | I.HandlingError _ ->
loop tokens (I.resume cp)
| I.InputNeeded _ ->
match tokens with
| [] -> assert false
| token :: tokens -> loop tokens (I.offer cp token)
val loop : Parser.token positioned list -> 'a I.checkpoint -> 'a option
It takes an explicit list of tokens, offer all of them to the parser, and returns Some ast
if parsing succeeds. This is a minimal introduction to the incremental interface of Menhir and does not enable us to do anything new yet.
For each entry point, Menhir generates a function that will give you an initial checkpoint:
val Parser.Incremental.entry_point :
Lexing.position -> sem_value I.checkpoint
To give a similar interface to the normal and recoverable parser, and to build the recovery on top of normal parsing, we will use the following abstraction:
module type PARSE_INTF = sig
type 'a parser
val initial :
(Lexing.position -> 'a I.checkpoint) ->
Lexing.position -> 'a parser
type 'a step =
| Intermediate of 'a parser
| Success of 'a
| Error
val step : 'a parser -> P.token positioned -> 'a step
end
This just gives a bit more structure to the code from previous section, and consumes token one at a time. When input is wrong, we just get Error
. Instead we would like get access to the internal state of the parser to see what went wrong and what can be done about it.
So we will extend the interface with two functions to break the abstraction: one to get access to the checkpoint, and one to produce a new parser from a custom checkpoint.
module type RECOVER_INTF = sig
include PARSE_INTF
val recover : 'a I.checkpoint -> 'a parser
val recovery_env : 'a parser -> 'a I.env
end
And here is an implementation of normal parsing that satisfy this interface:
module Parser : RECOVER_INTF =
struct
type 'a parser = 'a I.checkpoint
let initial entrypoint position = entrypoint position
type 'a step =
| Intermediate of 'a parser
| Success of 'a
| Error
let rec normalize = function
| I.InputNeeded _ as cp -> Intermediate cp
| I.Accepted x -> Success x
| I.HandlingError _ | I.Rejected -> Error
| I.Shifting (_, _, _) | I.AboutToReduce (_, _) as cp ->
normalize (I.resume cp)
let step cp token =
normalize (I.offer cp token)
end
On top of the previous abstraction we will build the recoverable parser.
First, we will make use of the meta-data produced by menhir-recover
executable (it uses the Recover
module produced by the dune rule above):
module R =
Merlin_recovery.Make(I)
(struct
include Recover
let default_value loc x =
Default.default_loc := loc;
default_value x
let guide _ = false
end)
Merlin_recovery
implements indentation guided recovery heuristic on top of the meta-data.
The two default_value
and guide
extra functions lets you customize the recovery. You can ignore them for now, we just provided reasonable defaults here.
And now the robust parser:
module RP : PARSE_INTF =
struct
type 'a parser =
| Correct of 'a P.parser
| Recovering of 'a R.candidates
let initial entry_point position =
Correct (P.initial entry_point position)
type 'a step =
| Intermediate of 'a parser
| Success of 'a
| Error
let step parser token =
match parser with
| Correct parser ->
begin match P.step parser token with
| P.Intermediate parser -> Intermediate (Correct parser)
| P.Success x -> Success x
| P.Error ->
let env = P.recovery_env parser in
Intermediate (Recovering (R.generate env))
end
| Recovering candidates ->
begin match R.attempt candidates token with
| `Ok (cp, _) -> Intermediate (Correct (P.recover cp))
| `Accept x -> Success x
| `Fail ->
begin match token with
| Raw_parser.EOF, _, _ ->
begin match candidates.final with
| None -> Error
| Some x -> Success x
end
| _ -> Intermediate parser
end
end
end
It keeps using the normal parser as long as it works, and switch to the recovery heuristic on failure, hiding the fact that the normal parser returned an Error
. If recovery fails, it propagates the Error
, otherwise it returns to normal parsing.
A recovery friendly parser should not raise exceptions and should provide recovery annotations for the a few critical grammatical constructions.
The recovery algorithm take over the normal parser to continue parsing. An error situation occurs when there are no transition on the next token.
However recovery is blind to exceptions that are raised by semantic actions. Failing actions could lead the recovery to a dead-end and make it fail just like the normal parser. Strictly speaking, it is possible to instruct the recovery to avoid raising semantic actions, however the rule of thumb is to avoid exceptions in semantic actions.
Although less important, it is good practice to avoid using the error
token. The error
mechanism of Menhir becomes a bit redundant in presence of parser instrumentation and there are better ways to achieve similar results (ask me if necessary :)).
To recover, the algorithm will synthesize artificial symbols (terminals
/tokens
and non-terminals
) to complete the current parsing context (for instance, provide a )
to finish a mis-bracketed expression, or an identifier after an unterminated foo.
).
Non-parameterized symbols are easy to synthesized (e.g. )
require no extra input), but parameterized terminals and non-terminals require more information to be produced.
After %token<t> FOO
declaration, it is possible to add an annotation [@recover.expr ocaml_expression]
where ocaml_expression
has type t
.
The FOO
token will now be considered by the recovery algorithm when it tries to find a way to complete an unterminated sentence. The expression will be evaluated every time the recovery algorithm needs to produce a FOO
token.
For instance, in the annotated OCaml parser, it is sometime necessary to produce identifiers:
%token <string> LIDENT
[@recover.expr "<invalid-lident>"]
recover.expr
annotations can also be put on non-terminals.
For instance, when an expression is missing, it is better to provide a placeholder expression than to let the recovery synthesize an arbitrary expression (even though ()
is an harmless one that is likely to be generated).
To this end, the following has been added to the prelude of the OCaml parser:
%[@recover.prelude
open Parsetree
open Ast_helper
...
let default_loc = ref Location.none
let default_expr () =
let id = Location.mkloc "merlin.hole" !default_loc in
Exp.mk ~loc:!default_loc (Pexp_extension (id, PStr []))
...
]
and finally in the main grammar:
expr [@recover.expr default_expr ()]:
...
The prelude is a set of definitions shared by all recovery expressions. Now, default_expr ()
will be evaluated whenever an expression is needed.
If the grammar does not provide enough annotations, recovery will not always be possible.
For instance if we remove the recover.expr
annotation on LIDENT
in the grammar, the post-processor will output a huge error log that look like:
Not enough annotation to recover from state 116:
optlabel ::= QUESTION . LIDENT COLON
Not enough annotation to recover from state 127:
signature_item ::= TYPE . (ext = ext) list(attribute) NONREC (params = type_parameters) type_longident PLUSEQ (priv = private_flag) (xs = reversed_bar_llist(extension_constructor_declaration)) list(post_item_attribute)
| TYPE . (ext = ext) list(attribute) (params = type_parameters) type_longident PLUSEQ (priv = private_flag) (xs = reversed_bar_llist(extension_constructor_declaration)) list(post_item_attribute)
generic_type_declaration(nonrec_flag,type_kind) ::= TYPE . (ext = ext) list(attribute) NONREC (params = type_parameters) LIDENT (kind_priv_manifest = type_kind) (xs = reversed_llist(preceded(CONSTRAINT,constrain))) list(post_item_attribute)
...
If we look at all the states that are stuck, we see that they always involve an identifier. In our case, it is quite easy to see that we should add annotations on LIDENT
, but in general you should look first for parameterized terminals to annotate.