remarkjs / remark-gfm

remark plugin to support GFM (autolink literals, footnotes, strikethrough, tables, tasklists)

Home Page:https://remark.js.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some `text` node has not `position` property

azu opened this issue · comments

commented

Subject of the issue

When parsing some text, the text node has not position property.

Your environment

  • OS: macOs 10.15.7
  • Packages:
    "remark-gfm": "^1.0.0",
    "remark-parse": "^9.0.0",
    "unified": "^9.2.1"

Env:

Node: 14.16.1 - ~/.volta/tools/image/node/14.16.1/bin/node
Yarn: 1.22.10 - ~/.volta/tools/image/yarn/1.22.10/bin/yarn
npm: 6.14.12 - ~/.volta/tools/image/node/14.16.1/bin/npm

Steps to reproduce

Reproducing repo: https://github.com/azu/remark-gfm-no-position-bug

import unified from "unified";
import remarkGfm from "remark-gfm";
import remarkParse from "remark-parse";

const remark = unified().use(remarkParse).use(remarkGfm);
const ast = remark.parse(`http://user:password@host:port/path?key=value#fragment`);
console.log(JSON.stringify(ast, null, 4));

output:

{
    "type": "root",
    "children": [
        {
            "type": "paragraph",
            "children": [
                {
                    "type": "text",
                    "value": "http://user:password@host:port/path?key=value#fragment"
                }
            ],
            "position": {
                "start": {
                    "line": 1,
                    "column": 1,
                    "offset": 0
                },
                "end": {
                    "line": 1,
                    "column": 60,
                    "offset": 59
                }
            }
        }
    ],
    "position": {
        "start": {
            "line": 1,
            "column": 1,
            "offset": 0
        },
        "end": {
            "line": 1,
            "column": 60,
            "offset": 59
        }
    }
}

Expected behavior

The text node should have position property.

{
    "type": "root",
    "children": [
        {
            "type": "paragraph",
            "children": [
                {
                    "type": "text",
                    "value": "http://user:password@host:port/path?key=value#fragment",
                    "position": {
                        "start": {
                            "line": 1,
                            "column": 1,
                            "offset": 0
                        },
                        "end": {
                            "line": 1,
                            "column": 60,
                            "offset": 59
                        }
                    }
                }
            ],
            "position": {
                "start": {
                    "line": 1,
                    "column": 1,
                    "offset": 0
                },
                "end": {
                    "line": 1,
                    "column": 60,
                    "offset": 59
                }
            }
        }
    ],
    "position": {
        "start": {
            "line": 1,
            "column": 1,
            "offset": 0
        },
        "end": {
            "line": 1,
            "column": 60,
            "offset": 59
        }
    }
}

Actual behavior

The text node has not position property.

{
    "type": "root",
    "children": [
        {
            "type": "paragraph",
            "children": [
                {
                    "type": "text",
                    "value": "http://user:password@host:port/path?key=value#fragment"
                }
            ],
            "position": {
                "start": {
                    "line": 1,
                    "column": 1,
                    "offset": 0
                },
                "end": {
                    "line": 1,
                    "column": 60,
                    "offset": 59
                }
            }
        }
    ],
    "position": {
        "start": {
            "line": 1,
            "column": 1,
            "offset": 0
        },
        "end": {
            "line": 1,
            "column": 60,
            "offset": 59
        }
    }
}

Escaping + http:// is something wrong:
http://a

commented

Unfortunately we can’t add them.

GH’s uses two algorithms to add autolinks (see this readme and some of the other issues there). One at parse time, another at “AST” time.

The one at parse time nicely adds positional info. The one at AST-time can’t infer the original position of certain URLs from text nodes, because the AST only has the starting and ending point, but no knowledge in this case:

- > a block quote in a list
    with https://example.com a URL

But, the fact that http://user:password@host:port/path?key=value#fragment is seen as a URL is incorrect 🤔

http://user:password@host:port/path?key=value#fragment
http://user:password@host:port/path?key=value#fragment

http://user:password@host:port/path?key=value#fragment
http://user:password@host:port/path?key=value#fragment

^-- it shouldn’t link

commented

However, It seems that gfm plugin can infer the position when parsing valid http://~ URL string.

import unified from "unified";
import remarkGfm from "remark-gfm";
import remarkParse from "remark-parse";

const remark = unified().use(remarkParse).use(remarkGfm);
const ast = remark.parse(`This is https://example.com`);
console.log(JSON.stringify(ast, null, 4));

to

{
    "type": "root",
    "children": [
        {
            "type": "paragraph",
            "children": [
                {
                    "type": "text",
                    "value": "This is ",
                    "position": {
                        "start": {
                            "line": 1,
                            "column": 1,
                            "offset": 0
                        },
                        "end": {
                            "line": 1,
                            "column": 9,
                            "offset": 8
                        }
                    }
                },
                {
                    "type": "link",
                    "title": null,
                    "url": "https://example.com",
                    "children": [
                        {
                            "type": "text",
                            "value": "https://example.com",
                            "position": {
                                "start": {
                                    "line": 1,
                                    "column": 9,
                                    "offset": 8
                                },
                                "end": {
                                    "line": 1,
                                    "column": 28,
                                    "offset": 27
                                }
                            }
                        }
                    ],
                    "position": {
                        "start": {
                            "line": 1,
                            "column": 9,
                            "offset": 8
                        },
                        "end": {
                            "line": 1,
                            "column": 28,
                            "offset": 27
                        }
                    }
                }
            ],
            "position": {
                "start": {
                    "line": 1,
                    "column": 1,
                    "offset": 0
                },
                "end": {
                    "line": 1,
                    "column": 28,
                    "offset": 27
                }
            }
        }
    ],
    "position": {
        "start": {
            "line": 1,
            "column": 1,
            "offset": 0
        },
        "end": {
            "line": 1,
            "column": 28,
            "offset": 27
        }
    }
}

I think that it is a bug related to http://.

result of list + blockquote + autolink
import unified from "unified";
import remarkGfm from "remark-gfm";
import remarkParse from "remark-parse";

const remark = unified().use(remarkParse).use(remarkGfm);
const ast = remark.parse(`- > a block quote in a list
    with https://example.com a URL`);
console.log(JSON.stringify(ast, null, 4));

to

{
    "type": "root",
    "children": [
        {
            "type": "list",
            "ordered": false,
            "start": null,
            "spread": false,
            "children": [
                {
                    "type": "listItem",
                    "spread": false,
                    "checked": null,
                    "children": [
                        {
                            "type": "blockquote",
                            "children": [
                                {
                                    "type": "paragraph",
                                    "children": [
                                        {
                                            "type": "text",
                                            "value": "a block quote in a list",
                                            "position": {
                                                "start": {
                                                    "line": 1,
                                                    "column": 5,
                                                    "offset": 4
                                                },
                                                "end": {
                                                    "line": 1,
                                                    "column": 28,
                                                    "offset": 27
                                                }
                                            }
                                        }
                                    ],
                                    "position": {
                                        "start": {
                                            "line": 1,
                                            "column": 5,
                                            "offset": 4
                                        },
                                        "end": {
                                            "line": 1,
                                            "column": 28,
                                            "offset": 27
                                        }
                                    }
                                }
                            ],
                            "position": {
                                "start": {
                                    "line": 1,
                                    "column": 3,
                                    "offset": 2
                                },
                                "end": {
                                    "line": 1,
                                    "column": 28,
                                    "offset": 27
                                }
                            }
                        },
                        {
                            "type": "paragraph",
                            "children": [
                                {
                                    "type": "text",
                                    "value": "with ",
                                    "position": {
                                        "start": {
                                            "line": 2,
                                            "column": 5,
                                            "offset": 32
                                        },
                                        "end": {
                                            "line": 2,
                                            "column": 10,
                                            "offset": 37
                                        }
                                    }
                                },
                                {
                                    "type": "link",
                                    "title": null,
                                    "url": "https://example.com",
                                    "children": [
                                        {
                                            "type": "text",
                                            "value": "https://example.com",
                                            "position": {
                                                "start": {
                                                    "line": 2,
                                                    "column": 10,
                                                    "offset": 37
                                                },
                                                "end": {
                                                    "line": 2,
                                                    "column": 29,
                                                    "offset": 56
                                                }
                                            }
                                        }
                                    ],
                                    "position": {
                                        "start": {
                                            "line": 2,
                                            "column": 10,
                                            "offset": 37
                                        },
                                        "end": {
                                            "line": 2,
                                            "column": 29,
                                            "offset": 56
                                        }
                                    }
                                },
                                {
                                    "type": "text",
                                    "value": " a URL",
                                    "position": {
                                        "start": {
                                            "line": 2,
                                            "column": 29,
                                            "offset": 56
                                        },
                                        "end": {
                                            "line": 2,
                                            "column": 35,
                                            "offset": 62
                                        }
                                    }
                                }
                            ],
                            "position": {
                                "start": {
                                    "line": 2,
                                    "column": 5,
                                    "offset": 32
                                },
                                "end": {
                                    "line": 2,
                                    "column": 35,
                                    "offset": 62
                                }
                            }
                        }
                    ],
                    "position": {
                        "start": {
                            "line": 1,
                            "column": 1,
                            "offset": 0
                        },
                        "end": {
                            "line": 2,
                            "column": 35,
                            "offset": 62
                        }
                    }
                }
            ],
            "position": {
                "start": {
                    "line": 1,
                    "column": 1,
                    "offset": 0
                },
                "end": {
                    "line": 2,
                    "column": 35,
                    "offset": 62
                }
            }
        }
    ],
    "position": {
        "start": {
            "line": 1,
            "column": 1,
            "offset": 0
        },
        "end": {
            "line": 2,
            "column": 35,
            "offset": 62
        }
    }
}
commented

However, It seems that gfm plugin can infer the position when parsing valid http://~ URL string.

This is exactly what I described in my previous comment!

I think that it is a bug related to http://.

I think it’s acceptable, because it’s impossible to solve.
GH algorithm 2 works on the AST, and in that phase it’s impossible to know the original positional info


The list + blockquote example was to illustrate, here is an actual working example: https://runkit.com/embed/tyohiws7jhem. There is no positional info on the link, because it has to use algorithm 2, and it would be hard to figure out from the AST information where that link was originally.

There is a bug: http://... should not turn into a link.

commented

Thanks for the details.

The one at AST-time can’t infer the original position of certain URLs from text nodes, because the AST only has the starting and ending point, but no knowledge in this case

Does it mean that transforming AST to AST?
Previously, I've written a similar AST to AST parser.
This parser walks AST and parses the node's value for getting text offset. The offset is based on the node's starting point.

I expected that all node has position property.
If some node has not position, we need to always add if(!node.position) { return; /* can not handle */ } statement.
I feel it is an invalid node because it is the same text type but some nodes have not position property.
So I want to remove the invalid node. However, removing the node and will break another one.

Actually, remark@12 has position property on this example.
https://runkit.com/azu/60a5b5b63d9183001a1c5670

Is there a workaround for avoiding creating non-postion node?
It is blocker to update remark-parser, and I need to investigate it.

commented

I have looked into this.
I think that it is possible, but the current missing some parts for implementation.

We may need following changes to transformGfmAutolinkLiterals.
This is pseudo-code diff for https://github.com/syntax-tree/mdast-util-gfm-autolink-literal/blob/515f945feda9ec57d53ab09dfd458a86e9ff8eb8/from-markdown.js#L81-L86

diff --git a/from-markdown.js b/from-markdown.js
index 8a24bcc..cc02eab 100644
--- a/from-markdown.js
+++ b/from-markdown.js
@@ -82,7 +82,22 @@ function findUrl($0, protocol, domain, path, match) {
     type: 'link',
     title: null,
     url: prefix + protocol + parts[0],
-    children: [{type: 'text', value: protocol + parts[0]}]
+    children: [{
+      type: 'text',
+      value: protocol + parts[0],
+      position: {
+        start: {
+          line: calcLineFromOffset(node.position.start.offset + match.index),
+          column: calcColumnFromOffset(node.position.start.offset + match.index),
+          offset: node.position.start.offset + match.index
+        },
+        end: {
+          line: calcLineFromOffset(node.position.start.offset + match.index + $0.length),
+          column: calcColumnFromOffset(node.position.start.offset + match.index + $0.length),
+          offset: node.position.start.offset + match.index + $0.length
+        }
+      }
+    }]
   }
 
   if (parts[1]) {

Lack of implementation:

  • mast-util-find-and-replace does not pass the matched node to replace function
    • It will needed to infer start offset correctly.
    • This is node in pseudo-code
  • mdast-util-from-markdown does not pass raw text(parsing text) to MdastExtension
    • It will needed to calculate line and column from offset
    • Or, get raw text from the node.
    • This is calcLineFromOffset and calcColumnFromOffset in pseudo-code
commented

📝 Workaround: disable mdast-util-gfm-autolink-literal transform.

import unified from "unified";
// @ts-ignore
import autolinkLiteral from "mdast-util-gfm-autolink-literal/from-markdown";
// disable autolink transforms
autolinkLiteral.transforms = [];
import remarkGfm from "remark-gfm";
import remarkParse from "remark-parse";
const remark = unified().use(remarkParse).use(remarkGfm);
const ast = remark.parse(`http://user:password@host:port/path?key=value#fragment`);
console.log(JSON.stringify(ast, null, 4));
commented

Does it mean that transforming AST to AST?
Previously, I've written a similar AST to AST parser.
This parser walks AST and parses the node's value for getting text offset. The offset is based on the > node's starting point.

Given this markdown:

- > a block quote in a list
    with https://example.com a URL

The text node has this value: a block quote in a list\nwith https://example.com a URL.

It starts at 1:5, and ends at 2:35.

The markdown could look like this:

- > a block quote in a list
with https://example.com a URL

This:

- > a block quote in a list
                                  with https://example.com a URL

This:

:::some custom markdown structure
    a block quote in a list
    with https://example.com a URL
:::

How do you know where the link starts and ends?

I expected that all node has position property.
If some node has not position, we need to always add if(!node.position) { return; /* can not handle > */ } statement.
I feel it is an invalid node because it is the same text type but some nodes have not position property.
So I want to remove the invalid node. However, removing the node and will break another one.

The position field on nodes is optional: https://github.com/syntax-tree/unist#node.
It is valid, and it is heavily used when doing more complex tasks. For example, when processing MDX, the input is markdown-like (remark), which goes through to HTML (rehype), and then compiles to JavaScript (recma).
Several nodes in this process are added (one example), without positional information, because they were not in the original document.

Actually, remark@12 has position property on this example.

Yes it did, but it also didn’t match GFM. To properly match GFM, the two algorithms need to be used: one at parse time, one at “AST time”.

Hi! This was closed. Team: If this was fixed, please add phase/solved. Otherwise, please add one of the no/* labels.

commented

I fixed what seemed like the bug.
However, unfortunately I don’t see a way to “solve” this issue (text nodes w/o position) :(

commented

Thanks

However, unfortunately I don’t see a way to “solve” this issue (text nodes w/o position) :(

Would it be possible to provide a option that disable mdast-util-gfm-autolink-literal?
https://github.com/remarkjs/remark-gfm#unifieduseremarkgfm-options
If remark-gfm has the option, we can avoid to use a workaround.
(📝 Probably, this workaround will not work with Node.js exports filed)

commented

I don’t see a reason for an option to turn all of mdast-util-gfm-autolink-literal off in mdast-util-gfm or remark-gfm. The whole point of these is that they match GFM. And the separate packages can be used and matched however you please (see § When to use this).

I somewhat see an option to turn algorithm2 off, which is what you do in the workaround, in mdast-util-gfm-autolink-literal, which would be a major (because the API has to change to accept options), but bubble through in mdast-util-gfm and remark-gfm as a minor release.

The reason is that algorithm2 does not work in comments, and thus that option would somewhat match some flavor of GFM. But not 100%, because email links do work in comments: https://gist.github.com/wooorm/076fd173c31ba6837f17591d5932476e.
so this option would not match comments.

I’m not convinced yet of such an option. @ChristianMurphy?
And if it would be added: a question would be how to name it.

(and I think your workaround will work fine with ESM!)

I don't really see the option as making much sense. 🤔
*-gfm plugins should, to the best of their ability, match GFM, having an option for "be less like GFM" seems to go against its goal. 🤔

Taking a step back, position on Node is optional. This is far from the only case where it can happen. I'd be curious to hear how it's being used, and discussing what alternative(s) exist for handling these cases.