tree-sitter / tree-sitter-javascript

Javascript grammar for tree-sitter

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`jsx_element` nodes on different line being parsed as start on the same line.

lucario387 opened this issue · comments

From v0.21.4, I experienced the jsx elements to be captured too widely. Here's an example of it

export default function Home() {
  return (
    <>
      <Button
        style={{ 
          color: 'blue', 
        }}
        disabled
      >
      </Button>
    </>
  )
}

The output of tree-sitter parse is the following:

(program [0, 0] - [13, 0]
  (export_statement [0, 0] - [12, 1]
    declaration: (function_declaration [0, 15] - [12, 1]
      name: (identifier [0, 24] - [0, 28])
      parameters: (formal_parameters [0, 28] - [0, 30])
      body: (statement_block [0, 31] - [12, 1]
        (return_statement [1, 2] - [11, 3]
          (parenthesized_expression [1, 9] - [11, 3]
            (jsx_element [2, 4] - [10, 7]
              open_tag: (jsx_opening_element [2, 4] - [2, 6])
              (jsx_element [2, 6] - [9, 15]
                open_tag: (jsx_opening_element [2, 6] - [8, 7]
                  name: (identifier [3, 7] - [3, 13])
                  attribute: (jsx_attribute [4, 8] - [6, 10]
                    (property_identifier [4, 8] - [4, 13])
                    (jsx_expression [4, 14] - [6, 10]
                      (object [4, 15] - [6, 9]
                        (pair [5, 10] - [5, 23]
                          key: (property_identifier [5, 10] - [5, 15])
                          value: (string [5, 17] - [5, 23]
                            (string_fragment [5, 18] - [5, 22]))))))
                  attribute: (jsx_attribute [7, 8] - [7, 16]
                    (property_identifier [7, 8] - [7, 16])))
                close_tag: (jsx_closing_element [8, 7] - [9, 15]
                  name: (identifier [9, 8] - [9, 14])))
              close_tag: (jsx_closing_element [9, 15] - [10, 7]))))))))

While this is the parsed result from commit c4739fe

(program [0, 0] - [13, 0]
  (export_statement [0, 0] - [12, 1]
    declaration: (function_declaration [0, 15] - [12, 1]
      name: (identifier [0, 24] - [0, 28])
      parameters: (formal_parameters [0, 28] - [0, 30])
      body: (statement_block [0, 31] - [12, 1]
        (return_statement [1, 2] - [11, 3]
          (parenthesized_expression [1, 9] - [11, 3]
            (jsx_element [2, 4] - [10, 7]
              open_tag: (jsx_opening_element [2, 4] - [2, 6])
              (jsx_element [3, 6] - [9, 15]
                open_tag: (jsx_opening_element [3, 6] - [8, 7]
                  name: (identifier [3, 7] - [3, 13])
                  attribute: (jsx_attribute [4, 8] - [6, 10]
                    (property_identifier [4, 8] - [4, 13])
                    (jsx_expression [4, 14] - [6, 10]
                      (object [4, 15] - [6, 9]
                        (pair [5, 10] - [5, 23]
                          key: (property_identifier [5, 10] - [5, 15])
                          value: (string [5, 17] - [5, 23]
                            (string_fragment [5, 18] - [5, 22]))))))
                  attribute: (jsx_attribute [7, 8] - [7, 16]
                    (property_identifier [7, 8] - [7, 16])))
                close_tag: (jsx_closing_element [9, 6] - [9, 15]
                  name: (identifier [9, 8] - [9, 14])))
              close_tag: (jsx_closing_element [10, 4] - [10, 7]))))))))

And the diff of the two parsed trees:

diff --git a/old_tree.txt b/new_tree.txt
index 31e1bb606db0..f0bb1eb3ac42 100644
--- a/old_tree.txt
+++ b/new_tree.txt
@@ -8,8 +8,8 @@
           (parenthesized_expression [1, 9] - [11, 3]
             (jsx_element [2, 4] - [10, 7]
               open_tag: (jsx_opening_element [2, 4] - [2, 6])
-              (jsx_element [3, 6] - [9, 15]
-                open_tag: (jsx_opening_element [3, 6] - [8, 7]
+              (jsx_element [2, 6] - [9, 15]
+                open_tag: (jsx_opening_element [2, 6] - [8, 7]
                   name: (identifier [3, 7] - [3, 13])
                   attribute: (jsx_attribute [4, 8] - [6, 10]
                     (property_identifier [4, 8] - [4, 13])
@@ -21,6 +21,6 @@
                             (string_fragment [5, 18] - [5, 22]))))))
                   attribute: (jsx_attribute [7, 8] - [7, 16]
                     (property_identifier [7, 8] - [7, 16])))
-                close_tag: (jsx_closing_element [9, 6] - [9, 15]
+                close_tag: (jsx_closing_element [8, 7] - [9, 15]
                   name: (identifier [9, 8] - [9, 14])))
-              close_tag: (jsx_closing_element [10, 4] - [10, 7]))))))))
+              close_tag: (jsx_closing_element [9, 15] - [10, 7]))))))))

As can be seen from the file above. despite <></> and <Button></Button> be on different lines, the <Button> element starts at [2, 6] instead of [3, 6] like in v0.21.3

I think the changes in 99be62f may have lead to this being affected.

It seems like adding token.immediate to had an effect on the outer _jsx_child: choice?

We could explicitly allow whitespace with something like

    _jsx_child: $ => choice(
      $.jsx_text,
      /[\s\p{Zs}\uFEFF\u2028\u2029\u2060\u200B]+/,
      $.html_character_reference,
      $._jsx_element,
      $.jsx_expression,
    ),

But TBH, this feels downstream of #328 upholding #221 . This is starting to feel like the wrong call, its not what estree-compatible parsers do but theres no spec requiring it either way.

what i mean is having

<div>
    <div/>
</div>

result in

(program [0, 0] - [3, 0]
  (expression_statement [0, 0] - [2, 6]
    (jsx_element [0, 0] - [2, 6]
      open_tag: (jsx_opening_element [0, 0] - [0, 5]
        name: (identifier [0, 1] - [0, 4]))
      (jsx_text [0, 5] - [1, 4])
      (jsx_self_closing_element [1, 4] - [1, 10]
        name: (identifier [1, 5] - [1, 8]))
      (jsx_text [1, 10] - [2, 0])
      close_tag: (jsx_closing_element [2, 0] - [2, 6]
        name: (identifier [2, 2] - [2, 5])))))

would also solve this problem (and be helpful for babel/acorn compatibility) but would go back on #221

what do you think @amaanq

What if we just forbid newlines in the first regex group? The problem is (and this is a somewhat known issue) the spaces get consumed when parsing the author's original example, and then get lumped in with the parsing of the </ token because no prior token was produced (ideally a jsx_text one).

Log output showing what I mean:

image

The </ token shouldn't include spaces before it, so I think an easy stopgap here to ensure that we get a jsx_text node is to disallow newlines in the first group like I mentioned, so we get jsx_text nodes when there's no newline, and if there is, it's skipped as long as it's all whitespace only.

@jackschu a question I have is, are these spaces after a newline rendered in React as literal spaces? I'm no React/JSX expert, but if they are, we should consider going back on 221 then imo, since real syntactic info about the code is lost. If it is not rendered as spaces, then we can either go with what I proposed.

are these spaces after a newline rendered in React as literal spaces?

Kind of. If newlines are separating text nodes then React will render these as literal spaces. My best understanding is that this is the current ruleset for react facebook/react#480 (comment)

All lines having leading or trailing whitespace are trimmed, all newlines are
removed, adjacent text separated by newlines become separated by a single
space. Any whitespace tabs are replaced with spaces. Strings inside
expressions are unaffected.

This means that these examples will include whitespace in what react emits

let x = (<div> <Button/> </div>)
// OR
let y = (<div>
    hi
    bye
</div>)

But this example will not

<div> 
    <Button/> 
</div>

(for more detail see the JS emitted by this playground link)

So babel, acorn, etc decide to just always emit jsx_text nodes containing newlines and let downstream tooling figure it out. This is the direction id lean especially given that the ruleset that react seems to use is not really codified by any spec. But I'm not sure if this would be bad for @lucario387 's use-case.

From what I can see of both of the examples, and from the playground link

let x = (<div> <Button/> </div>)
// OR
let y = (<div>
    hi
    bye
</div>)

They both do contain whitespaces, but they do not consider it part of the child element, but rather as separate, actual " " string nodes.
The latter let y = ... follows normal HTML spec, as

<p>
    Foo
    Bar
</p>

creates foo bar on browsers. This means \s* inbetween text nodes are stripped and trimmed to a singular whitespace, while whitespaces around an element are not captured at all. I believe the correct behavior should be mirroring that of HTML, considering jsx element are glorified HTML elements.

Agreed, I think this is what @amaanq and I are talking about w.r.t. 'going back on 221'