BranchTaken / Hemlock

Programming language

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Convert scanner from DAG+recursion to DFA

jasone opened this issue · comments

Enhance the interpolated string scanner to emit *istring tokens for strings with embedded format specifiers. The *istring tokens should provide all specifier options in decoded form, as well as the string constant prefixing the format specifier.

Tasks:

  • Scan \% within interpolated strings as protected % codepoints
  • Convert DAG to DFA
  • #223
  • #224
  • Subsume token scanning into DFA
    • Dentation
    • Paren comments
    • Operators
    • Identifiers
    • codepoint
      • Prohibit raw tabs
    • string
      • Interpolated strings
        • Prohibit raw tabs
      • Raw strings
      • Documentation strings (né bar-margin strings)
    • real
    • Integers
  • Implement formatted string scanning

#111 (comment) details the *istring (né *fstring) token design:

Regarding tokenization of formatted strings, we're going to have to add some statefulness to the scanner, because there may be more than two components to each string. Furthermore, because formatted strings can nest, we need a stack of states. The following example shows how the tokens combine position {left,inner,right} and kind of nested code {width,precision,value}.

"...%*(^width^).*(^precision^)r(^value^)...%6.*(^precision^)r(^value2^)...%u(^value3^)..."
~~~~~~~~     ~~~~~~         ~~~~~     ~~~~~~~~~~~         ~~~~~      ~~~~~~~~~      ~~~~~~
lw           pi             v         ip                  v          iv             r        

Format string token sequences conform to paths through a deterministic finite automaton (DFA), with start/accept states and transitions as indicated.

  • (start) lw → {p, v}
  • (start) lp → {v}
  • (start) lv → {iw, ip, iv, r}
  • iw → {p, v}
  • ip → {v}
  • iv → {iw, ip, iv, r}
  • p → {v}
  • v → {iw, ip, iv, r}
  • r (accept)

Per #159 (comment), we're going to make some minor changes to format specifiers to make it easier to format diagnostics strings. For example, where one might have written "x=%u(^x^)", it will be possible to get the same result by writing "%u=(^x^)". Summary of design changes:

  • A "separator" may follow the type abbreviation, in which case the formatted string is prefixed with a stringified rendition of the value expression and the separator, i.e. "<stringified><sep><rep>". The separator must match [ ]*=[ ]*. [December 3, 2021] We're actually going to use [ ]*<infix_op>[ ]* for the separator, which will also allow e.g. ->.
  • A (^...^)-delimited formatter-generating expression of type t -> (>e:effect) -> Fmt.Formatter e >e-> Fmt.Formatter e, where t is the type of the value expression, may be specified in place of the type abbreviation. The formatter is applied to the value in similar fashion to formatter application for types with built-in support.
  • The f type abbreviation is obsoleted by the above, and will be removed from the design.

We motivated this design by considering existing pretty printers, especially one in the scanner for format specifiers shown below. We realized that %f=(^...^) wasn't going to work well. The above design changes resolve that issue, and they somewhat improve consistency, in that %(^fmt^)(^value^) specifies a separate formatter and value just as do specifiers for types with built-in support.

type istring_spec: istring_spec = {
    interp: option string
    pad: option codepoint
    just: option Fmt.just
    sign: option Fmt.sign
    alt: option bool
    zpad: option bool
    width: option uns
    pmode: option Fmt.pmode
    prec: option uns
    radix: option Radix.t
    notation: option Fmt.notation
    pretty: option bool
    abbr: option istring_abbr
  }

pp_istring_spec
  {interp; pad; just; sign; alt; zpad; width; pmode; prec; base; notation; pretty; abbr} formatter =
    formatter |> Fmt.fmt
      "{%(^Option.pp String.pp^)=(^interp
      ^); %(^Option.pp Codepoint.pp^)=(^pad
      ^); %(^Option.pp Fmt.pp_just^)=(^just
      ^); %(^Option.pp Fmt.pp_sign^)=(^sign
      ^); %(^Option.pp Bool.pp^)=(^alt
      ^); %(^Option.pp Bool.pp^)=(^zpad
      ^); %(^Option.pp Uns.pp^)=(^width
      ^); %(^Option.pp Fmt.pp_pmode^)=(^pmode
      ^); %(^Option.pp Uns.pp^)=(^prec
      ^); %(^Option.pp Radix.pp^)=(^radix
      ^); %(^Option.pp Fmt.pp_notation^)=(^notation
      ^); %(^Option.pp Bool.pp^)=(^pretty
      ^); %(^Option.pp pp_istring_abbr^)=(^abbr^)}"

The scanner DFA complexity was already borderline-unacceptable before yesterday's format specifier design tweaks, but the addition of yet another (^...^) pushed me to find a simpler approach. What I settled on is to break interpolated strings into many more tokens, each of which represents a more limited portion of format specifiers. This will push detection of various syntax errors to the parsing and desugaring phases, but error reporting will if anything be more specific. As for the scanner, it still needs a stack of states to support nesting, but at each nesting level it only needs to track whether scanning a) interpolated string data, b) format specifier, or c) an expression embedded within (^...^). Following is a rough approximation of what the parser productions will look like.

IstringSpecParam ::=                                                                          
| Tok_lparen_carat Expr Tok_carat_rparen                                                          
                                                                                                                    
IstringSpecPad ::=
| Tok_codepoint
| ε                                                                                                 
                                                                                                                    
IstringSpecJust ::=                                                                                                                                                                                                                                                                                                                                                                                                                                                              
| Tok_lt | Tok_carat | Tok_gt
| ε

IstringSpecSign ::=
| Tok_plus | Tok_uscore
| ε

IstringSpecAlt ::=
| Tok_hash
| ε

IstringSpecZpad ::=
| Tok_0
| ε

IstringSpecWidth ::=
| Tok_istring_uns # [1-9][0-9]*
| Tok_star IstringSpecParam
| ε

IstringSpecPrecision ::=
| Tok_dot Tok_istring_uns
| Tok_dot Tok_eq Tok_istring_uns
| Tok_dot Tok_star IstringSpecParam
| Tok_dot Tok_eq Tok_star IstringSpecParam
| ε

IstringSpecBase ::=
| Tok_b | Tok_o | Tok_d | Tok_x
| ε

IstringSpecNotation ::=
| Tok_m | Tok_a | Tok_c
| ε

IstringSpecPretty ::=
| Tok_p
| ε

IstringSpecBitwidth ::=
| Tok_istring_uns # Desugaring error if not (8|16|32|64|128|256|512)
| ε

IstringSpecFmt ::=
| Tok_b
| Tok_abbr_u IstringSpecBitwidth
| Tok_abbr_i IstringSpecBitwidth
| Tok_n | Tok_z
| Tok_c | Tok_s
| IstringSpecParam
| ε

IstringSpecSep ::=
| Tok_isubstring # Desugaring error if not [ ]*=[ ]*
| ε

IstringSpec ::=
| Tok_pct IstringSpecPad IstringSpecJust IstringSpecSign IstringSpecAlt IstringSpecZpad
  IstringSpecWidth IstringSpecPrecision IstringSpecBase IstringSpecNotation IstringSpecPretty
  IstringSpecFmt IstringSpecSep IstringSpecParam

IstringSubstring ::=
| Tok_isubstring
| ε

IstringBodyList ::=
| IstringBodyList IstringSpec IstringSubstring
| ε

IstringBody ::=
| IstringSubstring IstringBodyList

Istring ::= Tok_ditto IstringBody Tok_ditto

The scanner is currently actually implemented as a mishmash of a DAG for the initial part of the DFA, and a nest of mutually recursive functions thereafter. This ended up being a real pain to maintain, with lots of boilerplate, ad hoc intermediate state passing, etc. I've been poking at this for the past few days, trying to figure out a way to refactor it such that the scanner really is a DFA (actually a set of DFAs with the addition of format specifier scanning). But I kept bumping up against the limitations with cyclical data structures (which Hemlock is even more strict than OCaml about).

Finally today I backed up and implemented a DFA which separates the control flow from the DFA and incremental state. At a high level this can be thought of as converting to continuation-passing style (CPS). It looks like this approach is going to work, but it's going to take quite a bit of effort to complete the refactor. Instead of passing various intermediate state as function parameters, we will have per state data.

(* Before. *)
val fn_bslash: Text.Cursor.t -> accum -> Text.Cursor.t -> t

(* After. *)
type state =
  ...
  | Bslash of Text.Cursor.t * accum
  ...

I also disentangled lookahead from the DFA engine; the DFA operates on a view type which the scanner is free to manage anyway it wants (in our current use case, keeping three cursors in a tuple, (ppcursor, pcursor, cursor)). We may decide to just use Text.Cursor.prev instead of holding on to the previous cursors, but regardless view reduces coupling and makes the code easier to maintain.

The DFA infrastructure is stable now, and today I subsumed the top-level DAG into the DFA. The DAG code is totally gone now, but there's a lot of work to do in pulling the various token scanning logic into the DFA. It looks like the next logical step will be to subsume the Dentation module, since it is at the foundation of the scanner. After that's done I'll figure out a game plan for the rest of the refactor. It's likely to be a "simple" matter of working straight through the submodules until none are left to subsume.

Line directives turned out to be more fundamental to the internal scanner dependencies than dentation, so I took care of that first (#223). Now I'm working on dentation, and hit a brick wall due to the need for arbitrary lookahead when determining indentation level for (*...*) comments depending on whether non-whitespace/comment tokens follow on the same line. Fortunately I discovered a way to preserve current syntax without arbitrary lookahead, but it's going to require significant changes to how dentation is scanned.

Dentation scanning is now handled by the DFA. I ended up fixing a bunch of flaws as part of this rewrite that, had they remained, would have had me pulling my hair out in the parsing phase.

One of the details that used to be wrong in dentation scanning was the absence of line separator tokens after partial dedents.

a
    b
        c
        d
    e
    f

# Equivalent:
a (b (c; d); e; f)

The scanner wasn't emitting the separator between b (c; d) and e.

This got me to looking closely at our proposed syntax, and I found two problems. The biggie is that match expressions should be using continuation indentation for patterns.

match expr with
| A -> ...
| B -> ...

# Equivalent (given sufficiently sophisticated parsing):
match expr with ( | A -> ... | B -> ... )

# Should be:
match expr with
  | A -> ...
  | B -> ...

# Equivalent:
match expr with | A -> ... | B -> ...

I won't go into details here, but without continuation indentation for the patterns there are situations in which a single match expression confusingly looks like multiple expressions. Despite my love of aligned pattern matches, I think we should do away with this syntax inconsistency.

The other problem has to do with misleading | tokens in variant type declarations

type t: t =
    | A
    | B

# Equivalent:
type t: t = ( | A; | B)

Given continuation indentation rules and match expression syntax, the block-indented variants look over-indented. What we really want is:

type t: t =
    A
    B

# Equivalent:
type t: t = (A; B)

[December 31, 2021] Alternatively, we could do without block indentation for variant types.

type t: t =
  | A
  | B

# Equivalent:
type t: t = A | B

This probably fits better with other changes we're likely to make, e.g. the re-introduction of {...} around record type bodies (regardless of whether we unify records and modules).

There were several issues I needed to sort out before reimplementing real scanning, in particular #166 and #102. I significantly improved test coverage and fixed various bugs in the existing scanner, refactoring as I went to reduce incidental complexity relative to what the DFA-based scanner will do. I'm finally able to directly tackle the real scanning rewrite.