How to handle significant whitespace?

Question

How to handle significant whitespace?

joshuawarner32 opened this issue 2 years ago · comments

Hi! I'm interested in exploring how to use this parser system for a language that uses indentation to indicate scoping - which, notably, is not something a raw LR parser (or any strict context-free parser) can handle.

Normally, this would be handled in the lexer by consuming all the space/tab indentation in the lexer and emitting INDENT/DEDENT tokens for when the indentation is increased/decreased, respectively.

I'm puzzled as to how this works in the example py.lang grammar? The only mention of whitespace there seems to be immediately discarding it. I also don't see any mention in the technical paper on arxiv as to how this is handled.

Joe Zimmerman · Answer 1 · Tue Sep 27 2022 05:36:57 GMT+0800 (China Standard Time)

The relevant directive is ws_sig, as in:

    mode body ws_sig {

This will tell langcc that, even though the lexer may pass over whitespace, it should still have special machinery for converting the passed-over whitespace into newline/indent/dedent tokens. Please note that the whitespace machinery will have generally "Python-like" characteristics: e.g., backslashes at the end of lines will cause line continuation; and whitespace characters that occur between matching pairs of brackets ([], {}, ()) will be exempted.

Also, I have just pushed some changes that should significantly improve whitespace handling, so if you are working with significant whitespace, be sure to pull a fresh copy.

Joshua Warner · Answer 2 · Tue Sep 27 2022 10:30:06 GMT+0800 (China Standard Time)

Aha! That explains it! Do you have any plans to allow customizing this behavior in the future?

Joe Zimmerman · Answer 3 · Tue Sep 27 2022 11:09:47 GMT+0800 (China Standard Time)

No plans as of yet, but it's not out of the question. How specifically would you want to customize it?

Joshua Warner · Answer 4 · Tue Sep 27 2022 11:24:11 GMT+0800 (China Standard Time)

The haskell family of languages use a somewhat different version of whitespace-significant indentation than python - in particular, where the column of the first char of an "inner" line must be greater than the column of the "head" token on the "parent" line - and the "inner" line is considered a child of the inner-post parent for which this rule holds.

This has two practical effects on the look/feel of the code:

indentation doesn't follow "tab stops" - but is specified precisely to the column
different amounts of indenting on the next line can lead to very different meanings, by parenting that line to different parts of the former line

Joe Zimmerman · Answer 5 · Tue Sep 27 2022 22:47:47 GMT+0800 (China Standard Time)

I'm afraid this isn't quite enough detail for me to pinpoint how your desired behavior differs from the "Python-like" interpretation. Could you specify precisely when Newline, Indent, and Dedent tokens should be emitted (and how many), in your desired behavior? Also, would \ still serve as a line continuation, and would the interior of matching braces ([], {}, ()) still be exempt from whitespace processing?

Joshua Warner · Answer 6 · Wed Sep 28 2022 00:22:07 GMT+0800 (China Standard Time)

I guess what I'm hoping for is not to expand the configurability of whitespace handling here to include my domain - but rather making it much more flexible - tending more towards a turing-complete lexer than (what seems to be) a configurable lexer that's currently implemented.

Joe Zimmerman · Answer 7 · Wed Sep 28 2022 00:38:47 GMT+0800 (China Standard Time)

I will think about this, though it may not be something I can get to in the near future. One problem is that we still want to have the declarative, regex-based tokens spec in order to compile to an efficient DFA implementation for the lexer. So whatever Turing-complete language the lexer uses, that language should be restricted to the "action" part of the lexer mode definition, with the tokens still matched by regex.

Benjamin Kietzman · Answer 8 · Wed Sep 28 2022 21:30:34 GMT+0800 (China Standard Time)

Please note that the whitespace machinery will have generally "Python-like" characteristics: e.g., backslashes at the end of lines will cause line continuation; and whitespace characters that occur between matching pairs of brackets ([], {}, ()) will be exempted.

How would we override this for grammars which have non pythonic but still significant whitespace?

Joe Zimmerman · Answer 9 · Wed Sep 28 2022 23:36:53 GMT+0800 (China Standard Time)

How would we override this for grammars which have non pythonic but still significant whitespace?

It depends on the nature of the significant whitespace. If, like Python, it is still based on inserting Newline/Indent/Dedent tokens at appropriate points, then it may not be too difficult to modify the code to support it. If it is not of that form, it would probably be much more difficult. Fundamentally, the DFA-based lexer spec language of langcc is not designed for arbitrary significant whitespace, and so any solution is going to be somewhat hacky.

Benjamin Kietzman · Answer 10 · Thu Sep 29 2022 22:56:51 GMT+0800 (China Standard Time)

In my case I was hoping to have YAML-ish map declaration but delimited by {}, so newline/indent/dedent would definitely be sufficient for that

Joe Zimmerman · Answer 11 · Fri Sep 30 2022 00:20:56 GMT+0800 (China Standard Time)

Can you say a bit more about the desired behavior in this case? How does it differ from the current behavior with ws_sig?

Benjamin Kietzman · Answer 12 · Fri Sep 30 2022 22:53:35 GMT+0800 (China Standard Time)

My goal was to parse something like:

obj = {
  # fields obj.a, obj.b
  a: 0
  b: 1
  c:
    # nested fields obj.c.regular, obj.c.plusplus
    regular: 2
    plusplus: 9000
}

but if whitespace characters that occur between matching pairs of brackets ([], {}, ()) will be exempted then this would not be possible since the indents I'm looking at lie between a matching pair of {}.

Joe Zimmerman · Answer 13 · Sat Oct 01 2022 08:10:06 GMT+0800 (China Standard Time)

I am not quite sure I understand your example language. If the inner object is also a dictionary, then, to be consistent, shouldn't it also have braces?

obj = {
  a: 0
  b: 1
  c: {
    regular: 2
    plusplus: 9000
  }
}

Joe Zimmerman · Answer 14 · Sat Oct 01 2022 08:11:54 GMT+0800 (China Standard Time)

In any event, I have just pushed some changes that should allow customization of the delimiters in ws_sig mode (see grammars/py.lang, grammars/test/ws_sig_02_brackets.lang for examples) -- let me know if that works for your use case.