Some `text` node has not `position` property
azu opened this issue · comments
Subject of the issue
When parsing some text, the text
node has not position
property.
Your environment
- OS: macOs 10.15.7
- Packages:
"remark-gfm": "^1.0.0",
"remark-parse": "^9.0.0",
"unified": "^9.2.1"
Env:
Node: 14.16.1 - ~/.volta/tools/image/node/14.16.1/bin/node
Yarn: 1.22.10 - ~/.volta/tools/image/yarn/1.22.10/bin/yarn
npm: 6.14.12 - ~/.volta/tools/image/node/14.16.1/bin/npm
Steps to reproduce
Reproducing repo: https://github.com/azu/remark-gfm-no-position-bug
import unified from "unified";
import remarkGfm from "remark-gfm";
import remarkParse from "remark-parse";
const remark = unified().use(remarkParse).use(remarkGfm);
const ast = remark.parse(`http://user:password@host:port/path?key=value#fragment`);
console.log(JSON.stringify(ast, null, 4));
output:
{
"type": "root",
"children": [
{
"type": "paragraph",
"children": [
{
"type": "text",
"value": "http://user:password@host:port/path?key=value#fragment"
}
],
"position": {
"start": {
"line": 1,
"column": 1,
"offset": 0
},
"end": {
"line": 1,
"column": 60,
"offset": 59
}
}
}
],
"position": {
"start": {
"line": 1,
"column": 1,
"offset": 0
},
"end": {
"line": 1,
"column": 60,
"offset": 59
}
}
}
Expected behavior
The text
node should have position
property.
{
"type": "root",
"children": [
{
"type": "paragraph",
"children": [
{
"type": "text",
"value": "http://user:password@host:port/path?key=value#fragment",
"position": {
"start": {
"line": 1,
"column": 1,
"offset": 0
},
"end": {
"line": 1,
"column": 60,
"offset": 59
}
}
}
],
"position": {
"start": {
"line": 1,
"column": 1,
"offset": 0
},
"end": {
"line": 1,
"column": 60,
"offset": 59
}
}
}
],
"position": {
"start": {
"line": 1,
"column": 1,
"offset": 0
},
"end": {
"line": 1,
"column": 60,
"offset": 59
}
}
}
Actual behavior
The text
node has not position
property.
{
"type": "root",
"children": [
{
"type": "paragraph",
"children": [
{
"type": "text",
"value": "http://user:password@host:port/path?key=value#fragment"
}
],
"position": {
"start": {
"line": 1,
"column": 1,
"offset": 0
},
"end": {
"line": 1,
"column": 60,
"offset": 59
}
}
}
],
"position": {
"start": {
"line": 1,
"column": 1,
"offset": 0
},
"end": {
"line": 1,
"column": 60,
"offset": 59
}
}
}
Escaping + http://
is something wrong:
http://a
Unfortunately we can’t add them.
GH’s uses two algorithms to add autolinks (see this readme and some of the other issues there). One at parse time, another at “AST” time.
The one at parse time nicely adds positional info. The one at AST-time can’t infer the original position of certain URLs from text nodes, because the AST only has the starting and ending point, but no knowledge in this case:
- > a block quote in a list
with https://example.com a URL
But, the fact that http://user:password@host:port/path?key=value#fragment
is seen as a URL is incorrect 🤔
http://user:password@host:port/path?key=value#fragment
http://user:password@host:port/path?key=value#fragment
http://user:password@host:port/path?key=value#fragment
http://user:password@host:port/path?key=value#fragment
^-- it shouldn’t link
However, It seems that gfm plugin can infer the position
when parsing valid http://~
URL string.
import unified from "unified";
import remarkGfm from "remark-gfm";
import remarkParse from "remark-parse";
const remark = unified().use(remarkParse).use(remarkGfm);
const ast = remark.parse(`This is https://example.com`);
console.log(JSON.stringify(ast, null, 4));
to
{
"type": "root",
"children": [
{
"type": "paragraph",
"children": [
{
"type": "text",
"value": "This is ",
"position": {
"start": {
"line": 1,
"column": 1,
"offset": 0
},
"end": {
"line": 1,
"column": 9,
"offset": 8
}
}
},
{
"type": "link",
"title": null,
"url": "https://example.com",
"children": [
{
"type": "text",
"value": "https://example.com",
"position": {
"start": {
"line": 1,
"column": 9,
"offset": 8
},
"end": {
"line": 1,
"column": 28,
"offset": 27
}
}
}
],
"position": {
"start": {
"line": 1,
"column": 9,
"offset": 8
},
"end": {
"line": 1,
"column": 28,
"offset": 27
}
}
}
],
"position": {
"start": {
"line": 1,
"column": 1,
"offset": 0
},
"end": {
"line": 1,
"column": 28,
"offset": 27
}
}
}
],
"position": {
"start": {
"line": 1,
"column": 1,
"offset": 0
},
"end": {
"line": 1,
"column": 28,
"offset": 27
}
}
}
I think that it is a bug related to http://
.
result of list + blockquote + autolink
import unified from "unified";
import remarkGfm from "remark-gfm";
import remarkParse from "remark-parse";
const remark = unified().use(remarkParse).use(remarkGfm);
const ast = remark.parse(`- > a block quote in a list
with https://example.com a URL`);
console.log(JSON.stringify(ast, null, 4));
to
{
"type": "root",
"children": [
{
"type": "list",
"ordered": false,
"start": null,
"spread": false,
"children": [
{
"type": "listItem",
"spread": false,
"checked": null,
"children": [
{
"type": "blockquote",
"children": [
{
"type": "paragraph",
"children": [
{
"type": "text",
"value": "a block quote in a list",
"position": {
"start": {
"line": 1,
"column": 5,
"offset": 4
},
"end": {
"line": 1,
"column": 28,
"offset": 27
}
}
}
],
"position": {
"start": {
"line": 1,
"column": 5,
"offset": 4
},
"end": {
"line": 1,
"column": 28,
"offset": 27
}
}
}
],
"position": {
"start": {
"line": 1,
"column": 3,
"offset": 2
},
"end": {
"line": 1,
"column": 28,
"offset": 27
}
}
},
{
"type": "paragraph",
"children": [
{
"type": "text",
"value": "with ",
"position": {
"start": {
"line": 2,
"column": 5,
"offset": 32
},
"end": {
"line": 2,
"column": 10,
"offset": 37
}
}
},
{
"type": "link",
"title": null,
"url": "https://example.com",
"children": [
{
"type": "text",
"value": "https://example.com",
"position": {
"start": {
"line": 2,
"column": 10,
"offset": 37
},
"end": {
"line": 2,
"column": 29,
"offset": 56
}
}
}
],
"position": {
"start": {
"line": 2,
"column": 10,
"offset": 37
},
"end": {
"line": 2,
"column": 29,
"offset": 56
}
}
},
{
"type": "text",
"value": " a URL",
"position": {
"start": {
"line": 2,
"column": 29,
"offset": 56
},
"end": {
"line": 2,
"column": 35,
"offset": 62
}
}
}
],
"position": {
"start": {
"line": 2,
"column": 5,
"offset": 32
},
"end": {
"line": 2,
"column": 35,
"offset": 62
}
}
}
],
"position": {
"start": {
"line": 1,
"column": 1,
"offset": 0
},
"end": {
"line": 2,
"column": 35,
"offset": 62
}
}
}
],
"position": {
"start": {
"line": 1,
"column": 1,
"offset": 0
},
"end": {
"line": 2,
"column": 35,
"offset": 62
}
}
}
],
"position": {
"start": {
"line": 1,
"column": 1,
"offset": 0
},
"end": {
"line": 2,
"column": 35,
"offset": 62
}
}
}
However, It seems that gfm plugin can infer the position when parsing valid http://~ URL string.
This is exactly what I described in my previous comment!
I think that it is a bug related to
http://
.
I think it’s acceptable, because it’s impossible to solve.
GH algorithm 2 works on the AST, and in that phase it’s impossible to know the original positional info
The list + blockquote example was to illustrate, here is an actual working example: https://runkit.com/embed/tyohiws7jhem. There is no positional info on the link, because it has to use algorithm 2, and it would be hard to figure out from the AST information where that link was originally.
There is a bug: http://...
should not turn into a link.
Thanks for the details.
The one at AST-time can’t infer the original position of certain URLs from text nodes, because the AST only has the starting and ending point, but no knowledge in this case
Does it mean that transforming AST to AST?
Previously, I've written a similar AST to AST parser.
This parser walks AST and parses the node's value for getting text offset. The offset is based on the node's starting point.
I expected that all node has position
property.
If some node has not position
, we need to always add if(!node.position) { return; /* can not handle */ }
statement.
I feel it is an invalid node because it is the same text
type but some nodes have not position
property.
So I want to remove the invalid node. However, removing the node and will break another one.
Actually, remark@12 has position
property on this example.
https://runkit.com/azu/60a5b5b63d9183001a1c5670
Is there a workaround for avoiding creating non-postion
node?
It is blocker to update remark-parser, and I need to investigate it.
I have looked into this.
I think that it is possible, but the current missing some parts for implementation.
We may need following changes to transformGfmAutolinkLiterals.
This is pseudo-code diff for https://github.com/syntax-tree/mdast-util-gfm-autolink-literal/blob/515f945feda9ec57d53ab09dfd458a86e9ff8eb8/from-markdown.js#L81-L86
diff --git a/from-markdown.js b/from-markdown.js
index 8a24bcc..cc02eab 100644
--- a/from-markdown.js
+++ b/from-markdown.js
@@ -82,7 +82,22 @@ function findUrl($0, protocol, domain, path, match) {
type: 'link',
title: null,
url: prefix + protocol + parts[0],
- children: [{type: 'text', value: protocol + parts[0]}]
+ children: [{
+ type: 'text',
+ value: protocol + parts[0],
+ position: {
+ start: {
+ line: calcLineFromOffset(node.position.start.offset + match.index),
+ column: calcColumnFromOffset(node.position.start.offset + match.index),
+ offset: node.position.start.offset + match.index
+ },
+ end: {
+ line: calcLineFromOffset(node.position.start.offset + match.index + $0.length),
+ column: calcColumnFromOffset(node.position.start.offset + match.index + $0.length),
+ offset: node.position.start.offset + match.index + $0.length
+ }
+ }
+ }]
}
if (parts[1]) {
Lack of implementation:
- mast-util-find-and-replace does not pass the matched
node
toreplace
function- It will needed to infer start offset correctly.
- This is
node
in pseudo-code
- mdast-util-from-markdown does not pass raw text(parsing text) to MdastExtension
- It will needed to calculate
line
andcolumn
fromoffset
- structured-source is example of this logics implementation
Or, get raw text from thenode
.- This is
calcLineFromOffset
andcalcColumnFromOffset
in pseudo-code
- It will needed to calculate
📝 Workaround: disable mdast-util-gfm-autolink-literal
transform.
import unified from "unified";
// @ts-ignore
import autolinkLiteral from "mdast-util-gfm-autolink-literal/from-markdown";
// disable autolink transforms
autolinkLiteral.transforms = [];
import remarkGfm from "remark-gfm";
import remarkParse from "remark-parse";
const remark = unified().use(remarkParse).use(remarkGfm);
const ast = remark.parse(`http://user:password@host:port/path?key=value#fragment`);
console.log(JSON.stringify(ast, null, 4));
Does it mean that transforming AST to AST?
Previously, I've written a similar AST to AST parser.
This parser walks AST and parses the node's value for getting text offset. The offset is based on the > node's starting point.
Given this markdown:
- > a block quote in a list
with https://example.com a URL
The text node has this value: a block quote in a list\nwith https://example.com a URL
.
It starts at 1:5, and ends at 2:35.
The markdown could look like this:
- > a block quote in a list
with https://example.com a URL
This:
- > a block quote in a list
with https://example.com a URL
This:
:::some custom markdown structure
a block quote in a list
with https://example.com a URL
:::
How do you know where the link starts and ends?
I expected that all node has position property.
If some node has not position, we need to always addif(!node.position) { return; /* can not handle > */ }
statement.
I feel it is an invalid node because it is the same text type but some nodes have not position property.
So I want to remove the invalid node. However, removing the node and will break another one.
The position
field on nodes is optional: https://github.com/syntax-tree/unist#node.
It is valid, and it is heavily used when doing more complex tasks. For example, when processing MDX, the input is markdown-like (remark), which goes through to HTML (rehype), and then compiles to JavaScript (recma).
Several nodes in this process are added (one example), without positional information, because they were not in the original document.
Actually, remark@12 has position property on this example.
Yes it did, but it also didn’t match GFM. To properly match GFM, the two algorithms need to be used: one at parse time, one at “AST time”.
Hi! This was closed. Team: If this was fixed, please add phase/solved
. Otherwise, please add one of the no/*
labels.
I fixed what seemed like the bug.
However, unfortunately I don’t see a way to “solve” this issue (text nodes w/o position) :(
Thanks
However, unfortunately I don’t see a way to “solve” this issue (text nodes w/o position) :(
Would it be possible to provide a option that disable mdast-util-gfm-autolink-literal?
https://github.com/remarkjs/remark-gfm#unifieduseremarkgfm-options
If remark-gfm has the option, we can avoid to use a workaround.
(📝 Probably, this workaround will not work with Node.js exports
filed)
I don’t see a reason for an option to turn all of mdast-util-gfm-autolink-literal
off in mdast-util-gfm
or remark-gfm
. The whole point of these is that they match GFM. And the separate packages can be used and matched however you please (see § When to use this).
I somewhat see an option to turn algorithm2 off, which is what you do in the workaround, in mdast-util-gfm-autolink-literal
, which would be a major (because the API has to change to accept options), but bubble through in mdast-util-gfm
and remark-gfm
as a minor release.
The reason is that algorithm2 does not work in comments, and thus that option would somewhat match some flavor of GFM. But not 100%, because email links do work in comments: https://gist.github.com/wooorm/076fd173c31ba6837f17591d5932476e.
so this option would not match comments.
I’m not convinced yet of such an option. @ChristianMurphy?
And if it would be added: a question would be how to name it.
(and I think your workaround will work fine with ESM!)
I don't really see the option as making much sense. 🤔
*-gfm
plugins should, to the best of their ability, match GFM, having an option for "be less like GFM" seems to go against its goal. 🤔
Taking a step back, position
on Node
is optional. This is far from the only case where it can happen. I'd be curious to hear how it's being used, and discussing what alternative(s) exist for handling these cases.