Get marker delimitation

Question

Get marker delimitation

jokteur opened this issue 5 months ago · comments

Hello,

I am writing a WYSIWYG Markdown editor focused on math and science, and I want to use Markdown as the base format. The problem I am going to describe is present in many other Markdown parsers, as a result I decided to completely write a new parser from scratch (in C++) and make some modifications to the Markdown standard to fit my own needs (this is the result).

The prototype I wrote was working okay, but now I've decided to rewrite the whole application in Rust, and also decided to not maintain my own parser which is much more prone to bugs and crashes.

The marker delimitation problem

I am rewriting what I wrote here: https://github.com/jokteur/ab-parser#the-delimitation-marker-problem.

For my WYSIWYG application, I need to know where the markers of a specific block / span are, to temporarily display to the user the markers, like on this demo here: https://github.com/wooorm/markdown-rs/assets/25845695/420c1496-7306-4c69-b7ca-74059ec95886

Let's say that we have the following Markdown example:

- >> [abc
  >> def](example.com)

This example would generate an abstract syntax tree (AST) like:

DOC
  UL
    LI
      QUOTE
        QUOTE
          P
            URL
              TEXT

How do we attribute each non-text markers (like -, >, [, ...) to the correct block / span ?

My parser was created to solve this specific problem, while keeping reasonable performance. To do this, each object (BLOCK or SPAN) is represented by an vector of boundaries. A boundary is defined as follows:

struct Boundary {
    line_number: usize,
    pre: usize,
    beg: usize,
    end: usize,
    post: usize,
}

This struct designates offsets in the raw text which form its structure. line_number is the line number in the raw text on which the boundary is currently operating. Offsets between pre and beg are the pre-delimiters, and offsets between end and post are the post-delimiters. Everything between beg and end is the content of the block / span.

Here is a simple example. Suppose we have the following text: _italic_, which starts at line 0 and offset 0 then the boundary struct would look like {0, 0, 1, 7, 8}.

Going back to the first example, we now use the following notation to illustrate ownership of markers: if there is x, it indicates a delimiter, if there is _ it indicates content, and . indicates not in boundary. Here are the ownership for each block and span:

- >> [abc
  >> def](example.com)

UL:
_________
______________________

LI:
xx_______
xx____________________

QUOTE (1st):
..x______
..x___________________

QUOTE (2nd):
...xx____
...xx_________________

P:
.....____
....._________________

URL:
.....x___
.....___xxxxxxxxxxxxxx

TEXT:
......___
.....___..............

Is there any simple way to rewrite this kind of information ?

Currently, markdown-rs provides positional information like this:

Text { value: "abc\ndef", position: Some(1:7-2:10 (6-19)) }

I may have a workaround to rewrite this kind of information (after it has been parsed, go from leaf nodes, compare the text with raw text, and check which chars are part of the node or node, and attribute them to the parent). This workaround may be slow, but it is okay for my usage because I only need marker delimitation information where the cursor is (not on the whole document).

I don't really know how well markdown-rs works, how difficult would it be that have this information built-in the parser ?

Christian Murphy · Answer 1 · Mon Jan 01 2024 02:29:41 GMT+0800 (China Standard Time)

Welcome @jokteur! 👋
The overview of the project is a good starting point. https://github.com/wooorm/markdown-rs#overview
The process to parse markdown looks like this:

                    markdown-rs
+-------------------------------------------------+
|            +-------+         +---------+--html- |
| -markdown->+ parse +-events->+ compile +        |
|            +-------+         +---------+-mdast- |
+-------------------------------------------------+

If you want to work with raw events/tokens, rather than the AST, use the parse file/function.

jokteur · Answer 2 · Mon Jan 01 2024 23:09:59 GMT+0800 (China Standard Time)

And would they be any way to use the parser file/function without forking the project ? Because currently this API is private, which doesn't allow me to implement my own compiler on top of markdown-rs.

Christian Murphy · Answer 3 · Wed Jan 03 2024 00:14:19 GMT+0800 (China Standard Time)

There is also a JavaScript version of this project, on the JS side there is a lower level package micromark that exposes this. https://github.com/micromark/micromark
@wooorm may be able to comment on the intent on the rust side.

Titus · Answer 4 · Fri Jan 05 2024 19:52:11 GMT+0800 (China Standard Time)

No, it’s not exposed yet. This project is currently at the state where it has to get some traction IMO before all the internals are exposed, to figure out how to expose things, and whether to expose things.