unifiedjs / ideas

Share ideas for new utilities and tools built with @unifiedjs

Home Page:https://unifiedjs.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Potential future syntax trees?

chrisrzhou opened this issue · comments

Just curious what requirements constitute for a candidate for a future syntax tree? We currently have:

  • hast: HTML
  • mdast: markdown
  • xast: XML
  • nlcst: natural language

And ideas in this project around:

  • Elm
  • SQL
  • CSS

What does the unified community/team decide if it's 'worth' working on a new syntax tree in the ecosystem? Is the main decision based on if the data represents 'content', and something that can be marked-up?

This question is mainly motivated by looking at some of the popular content types and I was wondering how do the following fall into the ecosystem and if thinking of them as syntax-tree even makes sense:

  • csv: the only thing 'ast-y' about this is that you parse the content into Row, Column nodes.

  • epub: this seems like an exciting format to unify ebook-related content and tooling.

  • docx: a popular format, that if well-represented in unified, could be really exciting.

  • pdf: same as docx

  • tex: might be exciting for the scientific community if all popular flavors of tex are unified.

  • mathml

  • rtf

  • etc.

commented

Just curious what requirements constitute for a candidate for a future syntax tree?

There are none! You can make them if you want to. Whether it makes sense to be part of the collective, as in, whether it should be developed by us, isn’t clear: there’s no process or guidance on that.

Is the main decision based on if the data represents 'content', and something that can be marked-up?

I don’t think it has ever been a very clear decision. As xast is the most recent one, well, I needed that for epub (and docx), that it would be a useful addition.
One thing though: it’s a tremendous amount of work to maintain a parser (remark-parse/micromark), so I’d suggest that a good existing parser to depend on is a requirement.


  • json / csv — more data than content, but could be interesting indeed
  • epub / docx — zip archives of XML files, a collection of files doesn’t really make sense in unist in my opinion, but instead I can see xast utilities for working with these specific flavors (opf, ncx, etc)
  • pdf — don’t know enough about this
  • tex — yeah, that would be interesting!
  • rtf — same as tex, though: with these alternatives to HTML, is it really needed to have a separate syntax trees? Or can everything be done inside hast?
  • mathml — somewhat supported in hast when in html, or xast when in xml. Although, it isn’t implemented in browsers a lot. May not be worthwhile.

Thanks for the explanation and context on some of these! One last relating question to implementing 'unofficial' parsers (not that they are unofficial in a sense for reasons you mentioned above that anything is valid and the ecosystem is non-opinionated):

Is it 'cheating' if we represent any content type into a hast tree by converting them to html (through existing libraries) and then into hast?

The latter step is guaranteed to be unist-compatible (position and all), but the prior feels that we will be at the mercy at how well the other library does it. It's not a pure parser in a sense, but was wondering if this is a valid approach. My goal for now is to try to find a intermediate way to bridge document rendering and other document features through hast, so this is most likely the approach I'm taking until the ecosystem slowly expands on 'official' parsers which I will swap out in the future.

commented

Is it 'cheating' if we represent any content type into a hast tree by converting them to html (through existing libraries) and then into hast?

A bit. But this is also what pandoc does: one syntax tree with many readers and writers. The downside is that not everything can be represented. The upside is that you don’t have an exponential problem every time a new flavor arrives 😅