tree-sitter / tree-sitter

An incremental parsing system for programming tools

Home Page:https://tree-sitter.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Difficulties implementing structural editing and refactoring with tree-sitter

ethan-leba opened this issue · comments

I think tree-sitter presents an amazing opportunity to implement language-agnostic refactoring and structural editing in the editor. That is, given the syntax tree provided by TS and the description of the grammar (via the grammar JSON), perform some edit only if we can validate that the result will produce a correct ST, or use the grammar to infer where syntax nodes need to placed without requiring the user to manually specify them (WIP impl. here).

However, in it's current state TS has some issues with this usecase that I'd like to discuss with the tree-sitter community if we can determine any feasible solutions to these issues that would fit with the goals of TS, or if there are workarounds I'm not aware of!

Provide full state of the ST

The main inconvenience for this usecase is that tree-sitter's API does not present the true state of the syntax tree. While this is great for the querying and highlighting usecases, it forces the refactoring usecase to either hack around the difficulties or fork the grammars, neither of which is an ideal result.

Aliased node types

For example, in the C grammar the function_field_declarator node is aliased to function_declarator in struct nodes. So if we'd like to perform a modification on the node aliased to function_declarator, we'll look up the grammar rules for function_declarator and discover that the current node is invalid before we even attempt to change anything!

We could go look at the node above the aliased function_field_declarator and infer from that node's rules what the alias is (maybe), but that seems like a unnecessary complication.

Potential solution

Provide a ts_unaliased_node_type function to provide the 'true' type of a node. This seems simple for the alias(type A, type B) case, but I'm not sure what the behavior should be for alias(type, ... a bunch of rules ...).

Hidden leaf nodes

For example, the Python grammar hides the _newline, _dedent, and _indent leaf tokens, again leading to a situation where a node that TS parses correctly appears to not be correct according to the grammar.

Potential solution

Provide a movement API that doesn't skip hidden nodes. Mentioned previously here:

#1156 (comment)

Provide a strict mode for parsers

As is often the case, I think the most practical fix is to allow a superset of what Python really allows, and treat empty blocks as valid blocks.

tree-sitter/tree-sitter-python#65

This solution works perfectly fine when the code is being written by hand. However, this is problematic for a refactoring usecase, where the expectation is that the grammar JSON precisely describes the valid STs for the language.

Potential solution

A potential solution could be to denote these grammatical concessions so that tools consuming the TS grammar JSON can know that this is not actually a valid rule in the grammar:

_suite: $ => choice(
  alias($._simple_statements, $.block),
  seq($._indent, $.block),
  nonstrict(alias($._newline, $.block))
)

A flag could also be provided for those who don't want to allow a superset of the grammar, or maybe falling back to the nonstrict choice after attempting to parse the others could be treated as an error?

Please let me know what you all think of the above issues enumerated, and whether the solutions are viable!