posthtml / posthtml-parser

Parse HTML/XML to PostHTMLTree

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

parse5

thisconnect opened this issue · comments

commented

Hi, have you considered parse5 it is used by jsdom and comes with a htmlparser2 treeadapter
https://github.com/inikulin/parse5/wiki/Documentation#parse5treeadapters

I came here just to create this issue.

👍 👯

If it were to parse to the same data structure—as to provide as few breaking changes as possible—it would need a custom tree adaptor.

sax should probably be used to parse xml, but htmlparser2 could still be used. Either way, because parse5 does not parse xml documents (though, it will parse svg embedded within html), there will need to be two functions: parseHtml and parseXml -- so the API will change slightly.

We should probably support both strings and streams, which should bring us to a promise-based API as well.

A regex-based IE® conditional comment parser could be written to assist parse5 in getting more data (see #9). parse5's location info can be used for reference points.

Now that v0.9 introduced the options.parser, is there still an interest in swapping out htmlparser2 for parse5? I still think that the default should be true HTML, which means spec-compliant -- htmlparser2 is only pseudo-HTML.

@stevenvachon in progress :)

Parser using parse5 as the default parser
Core #159

Awesome, thanks!

This is the proposed final parser if anyone has thoughts: https://github.com/static-dev/posthtml-parser2/pull/2

As a result of @jescalan's PR getting rejected, I'm getting the feeling that this project will lay dormant for a long while until randomly receiving a 1.0 stamp, just like Grunt did. A quality, custom spec-compliant parser is too ambitious.

If you want to use the results of my PR immediately, you can use it through reshape. It's already being used in production at a few projects at my company and is working well. Will see a public release in the next month or so. The parser can be seen here, it uses parse5's SAXParser as discussed.

I can't speak to whether this assumption about posthtml is accurate, but if I had to guess I would tend to agree with you here. There hasn't been much if any progress made in the last couple months, and as you said writing a custom parser is a not really necessary and also too ambitious.

The 'problem' on the root is the limitation of the current AST, it's simply impossible to progressively enhance it. The PostHTML interface itself should be as flexible as possible e.g use htmlparser2, parse5, whatever, interchangeably up to the taste of the user and his/her's usecase, while the AST Format must be agreed upon, the parser to generate it is not important.

long-term, of course there should be a parser developed here and acting as the default...it's AST, and thats the important thing, will be the baseline for all other possible implementations.

There is already progress and the time it takes, it simply takes :)
AST

import parse5 from 'posthtml-parse5' // posthtml-pug, posthtml-hbs, posthtml-parser2, posthtml-parser(PostHTML's own HTML Parser)
import jsx from 'posthtml-jsx' // posthtml-js, posthtml-render

posthtml(plugins)
  .process('file.html', { parser: parse5, render: jsx }).... // etc...
  .then((result) => {
      result.html // with/without stringifier (e.g to js/jsx)
      result.tree // with/without generator
   })

@michael-ciniawsky that wasn't the problem though. The argument was that the AST cannot change because "we have users".

@stevenvachon yeah well... :), this discussion...i would give it a go, the gist above is from one core maintainers and i believe there will be some progress, if not or with tooooo much time without even the indication of progress, the so called 'market' will simply move on to other solutions if they are 'superior'.

@michael-ciniawsky and you want people to move on from your hard work?

@stevenvachon Yeah no... 😛 I was in the 'find a compromise' camp from the beginning, but I can't do anything in this regard. I would have a few updates upon my sleeve, which are blocked for a while now aswell. A 'unification' effort and progress would make the most sense from a sole projects standpoint, but I'm not blocking nor able to do something about it ¯_(ツ)_/¯

What is blocking them? I don't see any work being done to unblock anything.

Hi, just want to pop in and clarify that nobody has ever reached out to me about any "unification effort", nor have I ever been opposed to it (which is why the features that turned into reshape started out as a PR to posthtml), so I hope it's not being implied that I am the cause of anything being "blocked."

I am entirely open any type of unification effort, so long as it does not end in a sacrifice of any of reshape's features, code quality, or test coverage. I'm available and happy to talk at any time.

✨ 💖 ✨

cheerio 1.0.0-rc1 now uses parse5 by default.