micromark / micromark

small, safe, and great commonmark (optionally gfm) compliant markdown parser

Home Page:https://unifiedjs.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Attention misnests tokens

zamfofex opened this issue · comments

I have been wishing to write a (simple and lightweight) spec‐compliant editor for Markdown with syntax highlighting for a while now.

Now that this library has become usable (and it seems to be the first of its kind), I have finally gotten an opportunity to write a simple editor with it! (Thank you! 🎉)

Unfortunately, there appears to be a bug in the library! The issue I’m running into is that emphases marked with *** (both regular and strong) have their tokens misnested.

I have written a simple program to demonstrate what I mean:

simple reproduction example
import parser from "https://dev.jspm.io/micromark@2.0.0/lib/parse.js"
import preprocessor from "https://dev.jspm.io/micromark@2.0.0/lib/preprocess.js"
import postprocessor from "https://dev.jspm.io/micromark@2.0.0/lib/postprocess.js"

let preprocess = txt =>
{
	let write = preprocessor()
	return [...write(txt), ...write(null)]
}

let parse = text => postprocessor()(preprocess(text).flatMap(parser().document().write))

let tokens = parse("hello ***world***")
tokens.pop()

let output = ""

let i = 0
let offset
for (let [kind, {type, start, end}] of tokens)
{
	let char = "→"
	if (kind === "enter") offset = start.offset
	else offset = end.offset, i--, char = "←"
	output += `${" ".repeat(i*3) + char} ${type} at ${offset}\n`
	if (kind === "enter") i++
}

console.log(output)

(Note: I’m using dev.jspm.io for now, as opposed to jspm.dev, because jspm.dev bundles the whole library into its index file, as opposed to separating it into multiple files. See more info on jspm.dev’s announcement post)

Currently, the output is the following:

current output
→ content at 0
   → paragraph at 0
      → data at 0
      ← data at 5
      → data at 5
      ← data at 6
      → emphasis at 8
         → emphasisSequence at 8
         ← emphasisSequence at 9
         → emphasisText at 9
            → strong at 6
               → strongSequence at 6
               ← strongSequence at 8
               → strongText at 8
                  → data at 9
                  ← data at 14
               ← strongText at 15
               → strongSequence at 15
               ← strongSequence at 17
            ← strong at 17
         ← emphasisText at 14
         → emphasisSequence at 14
         ← emphasisSequence at 15
      ← emphasis at 15
   ← paragraph at 17
← content at 17

As you can see, when moving from → emphasisText at 9 to → strong at 6 (as well as in other places), the indices go down, which is unexpected. This causes my highlighter to break! 😱

Thanks in advance for the attention!

commented

Yay, the first bug!

Thanks for trying out micromark so fast, and while there aren’t a lot of docs yet, you managed to get it working! ✨

The order in which they are placed, emphasis first, then strong, is as expected, because: a***b***c yields <p>a<em><strong>b</strong></em>c</p>. But indeed, the positional information is inverted

Currently, runs of “attention”, e.g., ***, are parsed as one token, and later split. I think that’s a bit ugly. I think instead it should be parsed as an attentionSequence with separate attentionMarkers per character, which are then changed to either emphasisMarker, strongMarker, or data, depending on what they mean.

while there aren’t a lot of docs yet, you managed to get it working!

Well, it really wasn’t that difficult given how everything is organized so well! 😊 🎉

Also, still regarding emphases, I feel like I should mention that it seems like there are also bugs when they are misclosed (*hello** and **hello*).

I also found other similar misordering bugs regarding unclosed fenced code blocks, e.g. you can write let tokens = parse("~~~\n\n") in my example above to verify it, as well as empty lines inside blockquotes, e.g. parse("> a\n>\n> b").

Do you feel like you’d prefer to use this issue to regard these general kinds of bugs, or do you think it’d make more sense to file another issue regarding them?

(Also, I have, at least temporarily, put my editor in https://micro-mde.vercel.app and called it “microMDE” — I hope you don’t mind the name. You can verify how these misordering issues affect the editor if you’d like.)

commented

Positional info is something I hadn't really looked into, because it didn't matter for the reference compiler that goes to html, but I'm now looking at it as I'm creating an mdast compiler. In your case you're going even further, every tokens position needs to be correct.

I think it be good to have these reports as separate issues, so they can be discussed and solved separately as well (Although I'd say the two attention bugs are the same one)

And I'm fine w the name!

commented

Should be fixed, and a couple of other bugs. Please try out 2.1.0 and let me know how it all goes!

Hello! Sorry for having delayed so much to respond, I got a bit carried away with other projects. But now I found some time to take a look at this, and using version 2.2.0 (the latest), it does seem that this was fixed! 🎉 Thank you!

On the other hand, though, the other bugs I had mentioned (with blockquotes and fenced code blocks) still seem to be present. If you feel like it makes sense, I can file other issue reports regarding those (either just one, or two, as you feel like would make more sense). Thanks in advance once again for the attention!

commented

Please do! And somewhat separate would be appreciated