jzimmerman / langcc

langcc: A Next-Generation Compiler Compiler

Home Page:https://langcc.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to handle significant whitespace?

joshuawarner32 opened this issue · comments

Hi! I'm interested in exploring how to use this parser system for a language that uses indentation to indicate scoping - which, notably, is not something a raw LR parser (or any strict context-free parser) can handle.

Normally, this would be handled in the lexer by consuming all the space/tab indentation in the lexer and emitting INDENT/DEDENT tokens for when the indentation is increased/decreased, respectively.

I'm puzzled as to how this works in the example py.lang grammar? The only mention of whitespace there seems to be immediately discarding it. I also don't see any mention in the technical paper on arxiv as to how this is handled.

The relevant directive is ws_sig, as in:

    mode body ws_sig {

This will tell langcc that, even though the lexer may pass over whitespace, it should still have special machinery for converting the passed-over whitespace into newline/indent/dedent tokens. Please note that the whitespace machinery will have generally "Python-like" characteristics: e.g., backslashes at the end of lines will cause line continuation; and whitespace characters that occur between matching pairs of brackets ([], {}, ()) will be exempted.

Also, I have just pushed some changes that should significantly improve whitespace handling, so if you are working with significant whitespace, be sure to pull a fresh copy.

Aha! That explains it! Do you have any plans to allow customizing this behavior in the future?

No plans as of yet, but it's not out of the question. How specifically would you want to customize it?

The haskell family of languages use a somewhat different version of whitespace-significant indentation than python - in particular, where the column of the first char of an "inner" line must be greater than the column of the "head" token on the "parent" line - and the "inner" line is considered a child of the inner-post parent for which this rule holds.

This has two practical effects on the look/feel of the code:

  1. indentation doesn't follow "tab stops" - but is specified precisely to the column
  2. different amounts of indenting on the next line can lead to very different meanings, by parenting that line to different parts of the former line

I'm afraid this isn't quite enough detail for me to pinpoint how your desired behavior differs from the "Python-like" interpretation. Could you specify precisely when Newline, Indent, and Dedent tokens should be emitted (and how many), in your desired behavior? Also, would \ still serve as a line continuation, and would the interior of matching braces ([], {}, ()) still be exempt from whitespace processing?

I guess what I'm hoping for is not to expand the configurability of whitespace handling here to include my domain - but rather making it much more flexible - tending more towards a turing-complete lexer than (what seems to be) a configurable lexer that's currently implemented.

I will think about this, though it may not be something I can get to in the near future. One problem is that we still want to have the declarative, regex-based tokens spec in order to compile to an efficient DFA implementation for the lexer. So whatever Turing-complete language the lexer uses, that language should be restricted to the "action" part of the lexer mode definition, with the tokens still matched by regex.

Please note that the whitespace machinery will have generally "Python-like" characteristics: e.g., backslashes at the end of lines will cause line continuation; and whitespace characters that occur between matching pairs of brackets ([], {}, ()) will be exempted.

How would we override this for grammars which have non pythonic but still significant whitespace?

How would we override this for grammars which have non pythonic but still significant whitespace?

It depends on the nature of the significant whitespace. If, like Python, it is still based on inserting Newline/Indent/Dedent tokens at appropriate points, then it may not be too difficult to modify the code to support it. If it is not of that form, it would probably be much more difficult. Fundamentally, the DFA-based lexer spec language of langcc is not designed for arbitrary significant whitespace, and so any solution is going to be somewhat hacky.

In my case I was hoping to have YAML-ish map declaration but delimited by {}, so newline/indent/dedent would definitely be sufficient for that

Can you say a bit more about the desired behavior in this case? How does it differ from the current behavior with ws_sig?

My goal was to parse something like:

obj = {
  # fields obj.a, obj.b
  a: 0
  b: 1
  c:
    # nested fields obj.c.regular, obj.c.plusplus
    regular: 2
    plusplus: 9000
}

but if whitespace characters that occur between matching pairs of brackets ([], {}, ()) will be exempted then this would not be possible since the indents I'm looking at lie between a matching pair of {}.

I am not quite sure I understand your example language. If the inner object is also a dictionary, then, to be consistent, shouldn't it also have braces?

obj = {
  a: 0
  b: 1
  c: {
    regular: 2
    plusplus: 9000
  }
}

In any event, I have just pushed some changes that should allow customization of the delimiters in ws_sig mode (see grammars/py.lang, grammars/test/ws_sig_02_brackets.lang for examples) -- let me know if that works for your use case.