Potential future syntax trees?

Question

Potential future syntax trees?

chrisrzhou opened this issue 4 years ago · comments

Chris Zhou commented 4 years ago

Just curious what requirements constitute for a candidate for a future syntax tree? We currently have:

hast: HTML
mdast: markdown
xast: XML
nlcst: natural language

And ideas in this project around:

Elm
SQL
CSS

What does the unified community/team decide if it's 'worth' working on a new syntax tree in the ecosystem? Is the main decision based on if the data represents 'content', and something that can be marked-up?

This question is mainly motivated by looking at some of the popular content types and I was wondering how do the following fall into the ecosystem and if thinking of them as syntax-tree even makes sense:

csv: the only thing 'ast-y' about this is that you parse the content into Row, Column nodes.
epub: this seems like an exciting format to unify ebook-related content and tooling.
docx: a popular format, that if well-represented in unified, could be really exciting.
pdf: same as docx
tex: might be exciting for the scientific community if all popular flavors of tex are unified.
mathml
rtf
etc.

Titus · Answer 1 · Tue Sep 01 2020 18:06:11 GMT+0800 (China Standard Time)

Just curious what requirements constitute for a candidate for a future syntax tree?

There are none! You can make them if you want to. Whether it makes sense to be part of the collective, as in, whether it should be developed by us, isn’t clear: there’s no process or guidance on that.

Is the main decision based on if the data represents 'content', and something that can be marked-up?

I don’t think it has ever been a very clear decision. As xast is the most recent one, well, I needed that for epub (and docx), that it would be a useful addition.
One thing though: it’s a tremendous amount of work to maintain a parser (remark-parse/micromark), so I’d suggest that a good existing parser to depend on is a requirement.

json / csv — more data than content, but could be interesting indeed
epub / docx — zip archives of XML files, a collection of files doesn’t really make sense in unist in my opinion, but instead I can see xast utilities for working with these specific flavors (opf, ncx, etc)
pdf — don’t know enough about this
tex — yeah, that would be interesting!
rtf — same as tex, though: with these alternatives to HTML, is it really needed to have a separate syntax trees? Or can everything be done inside hast?
mathml — somewhat supported in hast when in html, or xast when in xml. Although, it isn’t implemented in browsers a lot. May not be worthwhile.

Chris Zhou · Answer 2 · Wed Sep 02 2020 00:59:10 GMT+0800 (China Standard Time)

Thanks for the explanation and context on some of these! One last relating question to implementing 'unofficial' parsers (not that they are unofficial in a sense for reasons you mentioned above that anything is valid and the ecosystem is non-opinionated):

Is it 'cheating' if we represent any content type into a hast tree by converting them to html (through existing libraries) and then into hast?

The latter step is guaranteed to be unist-compatible (position and all), but the prior feels that we will be at the mercy at how well the other library does it. It's not a pure parser in a sense, but was wondering if this is a valid approach. My goal for now is to try to find a intermediate way to bridge document rendering and other document features through hast, so this is most likely the approach I'm taking until the ecosystem slowly expands on 'official' parsers which I will swap out in the future.

Titus · Answer 3 · Mon Sep 07 2020 01:17:25 GMT+0800 (China Standard Time)

Is it 'cheating' if we represent any content type into a hast tree by converting them to html (through existing libraries) and then into hast?

A bit. But this is also what pandoc does: one syntax tree with many readers and writers. The downside is that not everything can be represented. The upside is that you don’t have an exponential problem every time a new flavor arrives 😅