Hirevo / som-rs

An alternative implementation of the Simple Object Machine, written in Rust

Home Page:https://hirevo.github.io/som-rs/som_interpreter_bc/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improper parsing of symbols

Hirevo opened this issue · comments

The current state of how symbols are parsed in both interpreters in som-rs is somewhat non-standard, compared to other SOMs.

This issue stands to track the cases where som-rs behaves differently from other SOMs, in order to get them all fixed.

Here are the problematic cases that I am currently aware of:

  • Spaces between # and identifier (ex: # foo, accepted by most SOMs, rejected by som-rs)
  • Spaces between # and operator (ex: # +, accepted by most SOMs, rejected by som-rs)
  • Spaces between # and string literal (ex: # 'foo', accepted by most SOMs, rejected by som-rs)
  • Non-leading successive colons in selector (ex: #foo::, rejected by most SOMs, accepted by som-rs)
  • Leading digits after colons (ex: #foo:2:, rejected by most SOMs, accepted by som-rs)

Somewhat related to this issue is the situation with array literals, which suffer from a similar problem due to also using the # in the syntax:

  • Spaces between # and ( (ex: # (1 2 3), accepted by most SOMs, rejected by som-rs)

Most of these issues are due to the fact that the lexer is currently tokenizing the whole symbol at once (as: Token::Symbol(String)) instead of simply outputting its fragments (something like: [Token::Pound, Token::Selector(String)]).
Delegating the construction of the symbol to the parser would likely be the way forward to address these problems.

Hmmm. Interesting. I am not sure how I feel about these things.

I think we need more tests :)
Especially the situation around spaces is a little odd and an artifact of having a separate lexer in most SOM implementations. The lexer simply discards the space.
But the Smalltalk grammar (ANSI Smalltalk) doesn't explicitly mention spaces, instead it says that a quoted string is to be immediately preceded by a pound sign.

Squeak has the same behavior as SOM, allowing spaces, but it really looks odd to me, and Pharo seems to have fixed it, disallowing spaces between # and the rest of the symbol.

# foo just doesn't look right. The # could be misread as an operator here, for instance in something like 54 # bar, which should be a parse error, because 54 #bar is not a valid expression.

Hmm. I think the biggest problem with this at the moment is that we don't have a cross-SOM mechanism to test for parser errors.