tree-sitter / tree-sitter-haskell

Haskell grammar for tree-sitter.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Help with creating a new parser!

FoamScience opened this issue · comments

Hi guys,

First of all, thanks for the amazingly organized and commented code; couldn't find anything like this for a month!

So, I'm willing to use your external scanner as a base for my own, and I don't seem to get what's needed to compose a simple parser.
My parser is straightforward, just keeps consuming characters and advances the lexer until it encounters "{", ";", "}" or white space; so I thought the following would work, but no 😢 :

bool non_identifier_char(const uint32_t c) { return iswspace(c) || eq(';')(c) || eq('{') || eq('}') || eq('$'); };
const bool non_identifier_chars(State & state) { return non_identifier_char(state::next_char(state)); };

// If identifier symbol is active, fail if not an identifier char
Parser identifier = sym(Sym::identifier)(iff(cond::non_identifier_chars)(fail));
// Do nothing else, just check for identifiers
all = identifier;

Can anyone help? Thanks in advance!

hey there, happy to hear that the scanner is useful as a library!

Your parser only consumes one character, in order to do repeated parsing you'll have to use something like read_while.
Additionally, if you want the scanner to produce a successful result with the characters that have been accepted, you'll need to call the finish combinator.

Your example could be expressed roughly like this:

Parser identifier = sym(Sym::identifier)(
  iff(non_identifier_chars)(fail) + 
  parser::read_while(!non_identifier_char) + 
  parser::finish(Sym::identifier, "some description")
);

Thanks for the help! That got me half of the way, but I still get a weird error.

I want to parse:

one line;

as ("one" and "line" are identifiers)

(identifier identifier)

However, with:

function<Result(State &)> read_while_parser(Condition pred) {
  return [=](State & state) {
    while (true) {
      if (state::eof(state)) break;
      uint32_t c = state::next_char(state);
      if (!pred(state)) {
            mark("identifier");
            break;
      }
      state::advance(state);
    }
    return Result(Sym::identifier, false);
  };
}
Parser identifier = sym(Sym::identifier)(
  //iff(cond::non_identifier_chars)(fail) + 
  read_while_parser(cond::identifier_chars) + 
  parser::finish(Sym::identifier, "Identifier")
);
Parser all = identifier;

I get ("l" gets eaten somehow!):

State { syms = "identifier", indents = empty }
finish: Identifier
result: identifier, 3
State { syms = "identifier", indents = empty }
finish: Identifier
result: identifier, 3
State { syms = "identifier", indents = empty }
finish: Identifier
result: identifier, 3
State { syms = "identifier", indents = empty }
finish: Identifier
next: ;
result: identifier, 9
State { syms = "identifier", indents = empty }
finish: Identifier
result: identifier, 3
State { syms = "identifier", indents = empty }
finish: Identifier
next: ;
result: identifier, 9
State { syms = "identifier", indents = empty }
finish: Identifier
next: ;
result: identifier, 9
State { syms = "identifier", indents = empty }
finish: Identifier
result: identifier, 9
State { syms = "identifier", indents = empty }
finish: Identifier
result: identifier, 9
State { syms = "identifier", indents = empty }
finish: Identifier
result: identifier, 9
State { syms = "identifier", indents = empty }
finish: Identifier
result: identifier, 9

(ERROR [0, 0] - [1, 0]
  (identifier [0, 0] - [0, 3])
  (ERROR [0, 3] - [0, 5]
    (identifier [0, 3] - [0, 3])
    (ERROR [0, 4] - [0, 5]))
  (identifier [0, 5] - [0, 9]))

<ERROR>
  <identifier>one</identifier>

  <ERROR>
    <identifier></identifier>

    <ERROR>l</ERROR>
</ERROR>

  <identifier>ine;</identifier>
</ERROR>

which does not make much sense; any ideas?

You can view my code here in case you want to take a look at the grammar file.

Any help is much appreciated; Thanks in advance!

it looks to me like it's parsing the empty string as a successful identifier because you commented out the check for the non-identifier character and aren't verifying that you've seen at least one identifier character in your custom parser!

Oh, man, It was right in front of my eyes; Thanks a lot!

😅 my pleasure!

@FoamScience I'd be interested in your experience with the scanner library, performance-wise. See #41

@FoamScience I'd be interested in your experience with the scanner library, performance-wise. See #41

I'm actually using only one parser from the scanner library and I never felt any slowness compared to what I was using before (I'v tested only files up to 4MB in size with my grammar though).
To me, it seems like chaining parsers (or maybe just one of the ones the library provides) may be the cause of this issue.