Some `text` node has not `position` property

Question

Some `text` node has not `position` property

azu opened this issue 3 years ago · comments

azu commented 3 years ago

Subject of the issue

When parsing some text, the text node has not position property.

Your environment

OS: macOs 10.15.7
Packages:

    "remark-gfm": "^1.0.0",
    "remark-parse": "^9.0.0",
    "unified": "^9.2.1"

Env:

Node: 14.16.1 - ~/.volta/tools/image/node/14.16.1/bin/node
Yarn: 1.22.10 - ~/.volta/tools/image/yarn/1.22.10/bin/yarn
npm: 6.14.12 - ~/.volta/tools/image/node/14.16.1/bin/npm

Steps to reproduce

Reproducing repo: https://github.com/azu/remark-gfm-no-position-bug

import unified from "unified";
import remarkGfm from "remark-gfm";
import remarkParse from "remark-parse";

const remark = unified().use(remarkParse).use(remarkGfm);
const ast = remark.parse(`http&#x3A;//user:password@host:port/path?key=value#fragment`);
console.log(JSON.stringify(ast, null, 4));

output:

{
    "type": "root",
    "children": [
        {
            "type": "paragraph",
            "children": [
                {
                    "type": "text",
                    "value": "http://user:password@host:port/path?key=value#fragment"
                }
            ],
            "position": {
                "start": {
                    "line": 1,
                    "column": 1,
                    "offset": 0
                },
                "end": {
                    "line": 1,
                    "column": 60,
                    "offset": 59
                }
            }
        }
    ],
    "position": {
        "start": {
            "line": 1,
            "column": 1,
            "offset": 0
        },
        "end": {
            "line": 1,
            "column": 60,
            "offset": 59
        }
    }
}

Expected behavior

The text node should have position property.

{
    "type": "root",
    "children": [
        {
            "type": "paragraph",
            "children": [
                {
                    "type": "text",
                    "value": "http://user:password@host:port/path?key=value#fragment",
                    "position": {
                        "start": {
                            "line": 1,
                            "column": 1,
                            "offset": 0
                        },
                        "end": {
                            "line": 1,
                            "column": 60,
                            "offset": 59
                        }
                    }
                }
            ],
            "position": {
                "start": {
                    "line": 1,
                    "column": 1,
                    "offset": 0
                },
                "end": {
                    "line": 1,
                    "column": 60,
                    "offset": 59
                }
            }
        }
    ],
    "position": {
        "start": {
            "line": 1,
            "column": 1,
            "offset": 0
        },
        "end": {
            "line": 1,
            "column": 60,
            "offset": 59
        }
    }
}

Actual behavior

The text node has not position property.

{
    "type": "root",
    "children": [
        {
            "type": "paragraph",
            "children": [
                {
                    "type": "text",
                    "value": "http://user:password@host:port/path?key=value#fragment"
                }
            ],
            "position": {
                "start": {
                    "line": 1,
                    "column": 1,
                    "offset": 0
                },
                "end": {
                    "line": 1,
                    "column": 60,
                    "offset": 59
                }
            }
        }
    ],
    "position": {
        "start": {
            "line": 1,
            "column": 1,
            "offset": 0
        },
        "end": {
            "line": 1,
            "column": 60,
            "offset": 59
        }
    }
}

Escaping + http:// is something wrong:
http://a

Titus · Answer 1 · Fri May 14 2021 20:02:29 GMT+0800 (China Standard Time)

Unfortunately we can’t add them.

GH’s uses two algorithms to add autolinks (see this readme and some of the other issues there). One at parse time, another at “AST” time.

The one at parse time nicely adds positional info. The one at AST-time can’t infer the original position of certain URLs from text nodes, because the AST only has the starting and ending point, but no knowledge in this case:

- > a block quote in a list
    with https://example.com a URL

But, the fact that http://user:password@host:port/path?key=value#fragment is seen as a URL is incorrect 🤔

http&#x3A;//user:password@host:port/path?key=value#fragment
http://user:password@host:port/path?key=value#fragment

http://user:password@host:port/path?key=value#fragment
http://user:password@host:port/path?key=value#fragment

^-- it shouldn’t link

azu · Answer 2 · Sat May 15 2021 06:05:30 GMT+0800 (China Standard Time)

However, It seems that gfm plugin can infer the position when parsing valid http://~ URL string.

import unified from "unified";
import remarkGfm from "remark-gfm";
import remarkParse from "remark-parse";

const remark = unified().use(remarkParse).use(remarkGfm);
const ast = remark.parse(`This is https://example.com`);
console.log(JSON.stringify(ast, null, 4));

to

{
    "type": "root",
    "children": [
        {
            "type": "paragraph",
            "children": [
                {
                    "type": "text",
                    "value": "This is ",
                    "position": {
                        "start": {
                            "line": 1,
                            "column": 1,
                            "offset": 0
                        },
                        "end": {
                            "line": 1,
                            "column": 9,
                            "offset": 8
                        }
                    }
                },
                {
                    "type": "link",
                    "title": null,
                    "url": "https://example.com",
                    "children": [
                        {
                            "type": "text",
                            "value": "https://example.com",
                            "position": {
                                "start": {
                                    "line": 1,
                                    "column": 9,
                                    "offset": 8
                                },
                                "end": {
                                    "line": 1,
                                    "column": 28,
                                    "offset": 27
                                }
                            }
                        }
                    ],
                    "position": {
                        "start": {
                            "line": 1,
                            "column": 9,
                            "offset": 8
                        },
                        "end": {
                            "line": 1,
                            "column": 28,
                            "offset": 27
                        }
                    }
                }
            ],
            "position": {
                "start": {
                    "line": 1,
                    "column": 1,
                    "offset": 0
                },
                "end": {
                    "line": 1,
                    "column": 28,
                    "offset": 27
                }
            }
        }
    ],
    "position": {
        "start": {
            "line": 1,
            "column": 1,
            "offset": 0
        },
        "end": {
            "line": 1,
            "column": 28,
            "offset": 27
        }
    }
}

I think that it is a bug related to http://.

result of list + blockquote + autolink

import unified from "unified";
import remarkGfm from "remark-gfm";
import remarkParse from "remark-parse";

const remark = unified().use(remarkParse).use(remarkGfm);
const ast = remark.parse(`- > a block quote in a list
    with https://example.com a URL`);
console.log(JSON.stringify(ast, null, 4));

to

{
    "type": "root",
    "children": [
        {
            "type": "list",
            "ordered": false,
            "start": null,
            "spread": false,
            "children": [
                {
                    "type": "listItem",
                    "spread": false,
                    "checked": null,
                    "children": [
                        {
                            "type": "blockquote",
                            "children": [
                                {
                                    "type": "paragraph",
                                    "children": [
                                        {
                                            "type": "text",
                                            "value": "a block quote in a list",
                                            "position": {
                                                "start": {
                                                    "line": 1,
                                                    "column": 5,
                                                    "offset": 4
                                                },
                                                "end": {
                                                    "line": 1,
                                                    "column": 28,
                                                    "offset": 27
                                                }
                                            }
                                        }
                                    ],
                                    "position": {
                                        "start": {
                                            "line": 1,
                                            "column": 5,
                                            "offset": 4
                                        },
                                        "end": {
                                            "line": 1,
                                            "column": 28,
                                            "offset": 27
                                        }
                                    }
                                }
                            ],
                            "position": {
                                "start": {
                                    "line": 1,
                                    "column": 3,
                                    "offset": 2
                                },
                                "end": {
                                    "line": 1,
                                    "column": 28,
                                    "offset": 27
                                }
                            }
                        },
                        {
                            "type": "paragraph",
                            "children": [
                                {
                                    "type": "text",
                                    "value": "with ",
                                    "position": {
                                        "start": {
                                            "line": 2,
                                            "column": 5,
                                            "offset": 32
                                        },
                                        "end": {
                                            "line": 2,
                                            "column": 10,
                                            "offset": 37
                                        }
                                    }
                                },
                                {
                                    "type": "link",
                                    "title": null,
                                    "url": "https://example.com",
                                    "children": [
                                        {
                                            "type": "text",
                                            "value": "https://example.com",
                                            "position": {
                                                "start": {
                                                    "line": 2,
                                                    "column": 10,
                                                    "offset": 37
                                                },
                                                "end": {
                                                    "line": 2,
                                                    "column": 29,
                                                    "offset": 56
                                                }
                                            }
                                        }
                                    ],
                                    "position": {
                                        "start": {
                                            "line": 2,
                                            "column": 10,
                                            "offset": 37
                                        },
                                        "end": {
                                            "line": 2,
                                            "column": 29,
                                            "offset": 56
                                        }
                                    }
                                },
                                {
                                    "type": "text",
                                    "value": " a URL",
                                    "position": {
                                        "start": {
                                            "line": 2,
                                            "column": 29,
                                            "offset": 56
                                        },
                                        "end": {
                                            "line": 2,
                                            "column": 35,
                                            "offset": 62
                                        }
                                    }
                                }
                            ],
                            "position": {
                                "start": {
                                    "line": 2,
                                    "column": 5,
                                    "offset": 32
                                },
                                "end": {
                                    "line": 2,
                                    "column": 35,
                                    "offset": 62
                                }
                            }
                        }
                    ],
                    "position": {
                        "start": {
                            "line": 1,
                            "column": 1,
                            "offset": 0
                        },
                        "end": {
                            "line": 2,
                            "column": 35,
                            "offset": 62
                        }
                    }
                }
            ],
            "position": {
                "start": {
                    "line": 1,
                    "column": 1,
                    "offset": 0
                },
                "end": {
                    "line": 2,
                    "column": 35,
                    "offset": 62
                }
            }
        }
    ],
    "position": {
        "start": {
            "line": 1,
            "column": 1,
            "offset": 0
        },
        "end": {
            "line": 2,
            "column": 35,
            "offset": 62
        }
    }
}

Titus · Answer 3 · Wed May 19 2021 00:13:49 GMT+0800 (China Standard Time)

However, It seems that gfm plugin can infer the position when parsing valid http://~ URL string.

This is exactly what I described in my previous comment!

I think that it is a bug related to http://.

I think it’s acceptable, because it’s impossible to solve.
GH algorithm 2 works on the AST, and in that phase it’s impossible to know the original positional info

The list + blockquote example was to illustrate, here is an actual working example: https://runkit.com/embed/tyohiws7jhem. There is no positional info on the link, because it has to use algorithm 2, and it would be hard to figure out from the AST information where that link was originally.

There is a bug: http://... should not turn into a link.

azu · Answer 4 · Thu May 20 2021 09:26:00 GMT+0800 (China Standard Time)

Thanks for the details.

The one at AST-time can’t infer the original position of certain URLs from text nodes, because the AST only has the starting and ending point, but no knowledge in this case

Does it mean that transforming AST to AST?
Previously, I've written a similar AST to AST parser.
This parser walks AST and parses the node's value for getting text offset. The offset is based on the node's starting point.

https://github.com/azu/sentence-splitter/blob/1663ab26ba4ad29d9ab36616a572231a5e9d637c/src/sentence-splitter.ts#L182-L224

I expected that all node has position property.
If some node has not position, we need to always add if(!node.position) { return; /* can not handle */ } statement.
I feel it is an invalid node because it is the same text type but some nodes have not position property.
So I want to remove the invalid node. However, removing the node and will break another one.

Actually, remark@12 has position property on this example.
https://runkit.com/azu/60a5b5b63d9183001a1c5670

Is there a workaround for avoiding creating non-postion node?
It is blocker to update remark-parser, and I need to investigate it.

azu · Answer 5 · Sat May 22 2021 12:21:33 GMT+0800 (China Standard Time)

I have looked into this.
I think that it is possible, but the current missing some parts for implementation.

We may need following changes to transformGfmAutolinkLiterals.
This is pseudo-code diff for https://github.com/syntax-tree/mdast-util-gfm-autolink-literal/blob/515f945feda9ec57d53ab09dfd458a86e9ff8eb8/from-markdown.js#L81-L86

diff --git a/from-markdown.js b/from-markdown.js
index 8a24bcc..cc02eab 100644
--- a/from-markdown.js
+++ b/from-markdown.js
@@ -82,7 +82,22 @@ function findUrl($0, protocol, domain, path, match) {
     type: 'link',
     title: null,
     url: prefix + protocol + parts[0],
-    children: [{type: 'text', value: protocol + parts[0]}]
+    children: [{
+      type: 'text',
+      value: protocol + parts[0],
+      position: {
+        start: {
+          line: calcLineFromOffset(node.position.start.offset + match.index),
+          column: calcColumnFromOffset(node.position.start.offset + match.index),
+          offset: node.position.start.offset + match.index
+        },
+        end: {
+          line: calcLineFromOffset(node.position.start.offset + match.index + $0.length),
+          column: calcColumnFromOffset(node.position.start.offset + match.index + $0.length),
+          offset: node.position.start.offset + match.index + $0.length
+        }
+      }
+    }]
   }
 
   if (parts[1]) {

Lack of implementation:

mast-util-find-and-replace does not pass the matched node to replace function
- It will needed to infer start offset correctly.
- This is node in pseudo-code
mdast-util-from-markdown does not pass raw text(parsing text) to MdastExtension
- It will needed to calculate line and column from offset
  - structured-source is example of this logics implementation
- ~~Or, get raw text from the node.~~
- This is calcLineFromOffset and calcColumnFromOffset in pseudo-code

azu · Answer 6 · Sat May 22 2021 14:05:26 GMT+0800 (China Standard Time)

📝 Workaround: disable mdast-util-gfm-autolink-literal transform.

import unified from "unified";
// @ts-ignore
import autolinkLiteral from "mdast-util-gfm-autolink-literal/from-markdown";
// disable autolink transforms
autolinkLiteral.transforms = [];
import remarkGfm from "remark-gfm";
import remarkParse from "remark-parse";
const remark = unified().use(remarkParse).use(remarkGfm);
const ast = remark.parse(`http&#x3A;//user:password@host:port/path?key=value#fragment`);
console.log(JSON.stringify(ast, null, 4));

Titus · Answer 7 · Sat May 22 2021 23:18:45 GMT+0800 (China Standard Time)

Does it mean that transforming AST to AST?
Previously, I've written a similar AST to AST parser.
This parser walks AST and parses the node's value for getting text offset. The offset is based on the > node's starting point.

Given this markdown:

- > a block quote in a list
    with https://example.com a URL

The text node has this value: a block quote in a list\nwith https://example.com a URL.

It starts at 1:5, and ends at 2:35.

The markdown could look like this:

- > a block quote in a list
with https://example.com a URL

This:

- > a block quote in a list
                                  with https://example.com a URL

This:

:::some custom markdown structure
    a block quote in a list
    with https://example.com a URL
:::

How do you know where the link starts and ends?

I expected that all node has position property.
If some node has not position, we need to always add if(!node.position) { return; /* can not handle > */ } statement.
I feel it is an invalid node because it is the same text type but some nodes have not position property.
So I want to remove the invalid node. However, removing the node and will break another one.

The position field on nodes is optional: https://github.com/syntax-tree/unist#node.
It is valid, and it is heavily used when doing more complex tasks. For example, when processing MDX, the input is markdown-like (remark), which goes through to HTML (rehype), and then compiles to JavaScript (recma).
Several nodes in this process are added (one example), without positional information, because they were not in the original document.

Actually, remark@12 has position property on this example.

Yes it did, but it also didn’t match GFM. To properly match GFM, the two algorithms need to be used: one at parse time, one at “AST time”.

github-actions · Answer 8 · Sun Aug 15 2021 19:18:30 GMT+0800 (China Standard Time)

Hi! This was closed. Team: If this was fixed, please add phase/solved. Otherwise, please add one of the no/* labels.

Titus · Answer 9 · Sun Aug 15 2021 19:18:54 GMT+0800 (China Standard Time)

I fixed what seemed like the bug.
However, unfortunately I don’t see a way to “solve” this issue (text nodes w/o position) :(

azu · Answer 10 · Sun Aug 15 2021 19:49:58 GMT+0800 (China Standard Time)

Thanks

However, unfortunately I don’t see a way to “solve” this issue (text nodes w/o position) :(

Would it be possible to provide a option that disable mdast-util-gfm-autolink-literal?
https://github.com/remarkjs/remark-gfm#unifieduseremarkgfm-options
If remark-gfm has the option, we can avoid to use a workaround.
(📝 Probably, this workaround will not work with Node.js exports filed)

Titus · Answer 11 · Sun Aug 15 2021 20:55:40 GMT+0800 (China Standard Time)

I don’t see a reason for an option to turn all of mdast-util-gfm-autolink-literal off in mdast-util-gfm or remark-gfm. The whole point of these is that they match GFM. And the separate packages can be used and matched however you please (see § When to use this).

I somewhat see an option to turn algorithm2 off, which is what you do in the workaround, in mdast-util-gfm-autolink-literal, which would be a major (because the API has to change to accept options), but bubble through in mdast-util-gfm and remark-gfm as a minor release.

The reason is that algorithm2 does not work in comments, and thus that option would somewhat match some flavor of GFM. But not 100%, because email links do work in comments: https://gist.github.com/wooorm/076fd173c31ba6837f17591d5932476e.
so this option would not match comments.

I’m not convinced yet of such an option. @ChristianMurphy?
And if it would be added: a question would be how to name it.

(and I think your workaround will work fine with ESM!)

Christian Murphy · Answer 12 · Mon Aug 16 2021 03:28:04 GMT+0800 (China Standard Time)

I don't really see the option as making much sense. 🤔
*-gfm plugins should, to the best of their ability, match GFM, having an option for "be less like GFM" seems to go against its goal. 🤔

Taking a step back, position on Node is optional. This is far from the only case where it can happen. I'd be curious to hear how it's being used, and discussing what alternative(s) exist for handling these cases.