Improper parsing of symbols
Hirevo opened this issue · comments
The current state of how symbols are parsed in both interpreters in som-rs
is somewhat non-standard, compared to other SOMs.
This issue stands to track the cases where som-rs
behaves differently from other SOMs, in order to get them all fixed.
Here are the problematic cases that I am currently aware of:
- Spaces between
#
and identifier (ex:# foo
, accepted by most SOMs, rejected bysom-rs
) - Spaces between
#
and operator (ex:# +
, accepted by most SOMs, rejected bysom-rs
) - Spaces between
#
and string literal (ex:# 'foo'
, accepted by most SOMs, rejected bysom-rs
) - Non-leading successive colons in selector (ex:
#foo::
, rejected by most SOMs, accepted bysom-rs
) - Leading digits after colons (ex:
#foo:2:
, rejected by most SOMs, accepted bysom-rs
)
Somewhat related to this issue is the situation with array literals, which suffer from a similar problem due to also using the #
in the syntax:
- Spaces between
#
and(
(ex:# (1 2 3)
, accepted by most SOMs, rejected bysom-rs
)
Most of these issues are due to the fact that the lexer is currently tokenizing the whole symbol at once (as: Token::Symbol(String)
) instead of simply outputting its fragments (something like: [Token::Pound, Token::Selector(String)]
).
Delegating the construction of the symbol to the parser would likely be the way forward to address these problems.
Hmmm. Interesting. I am not sure how I feel about these things.
I think we need more tests :)
Especially the situation around spaces is a little odd and an artifact of having a separate lexer in most SOM implementations. The lexer simply discards the space.
But the Smalltalk grammar (ANSI Smalltalk) doesn't explicitly mention spaces, instead it says that a quoted string is to be immediately preceded by a pound sign.
Squeak has the same behavior as SOM, allowing spaces, but it really looks odd to me, and Pharo seems to have fixed it, disallowing spaces between #
and the rest of the symbol.
# foo
just doesn't look right. The #
could be misread as an operator here, for instance in something like 54 # bar
, which should be a parse error, because 54 #bar
is not a valid expression.
Hmm. I think the biggest problem with this at the moment is that we don't have a cross-SOM mechanism to test for parser errors.