parse5

Question

parse5

thisconnect opened this issue 8 years ago · comments

Hi, have you considered parse5 it is used by jsdom and comes with a htmlparser2 treeadapter
https://github.com/inikulin/parse5/wiki/Documentation#parse5treeadapters

Steven Vachon commented 8 years ago

+1

Yamil Asusta commented 8 years ago

👍 👯

Devin Alexander Torres · Answer 1 · Thu Jan 28 2016 00:56:11 GMT+0800 (China Standard Time)

I came here just to create this issue.

Steven Vachon · Answer 2 · Tue Feb 02 2016 19:10:13 GMT+0800 (China Standard Time)

If it were to parse to the same data structure—as to provide as few breaking changes as possible—it would need a custom tree adaptor.

sax should probably be used to parse xml, but htmlparser2 could still be used. Either way, because parse5 does not parse xml documents (though, it will parse svg embedded within html), there will need to be two functions: parseHtml and parseXml -- so the API will change slightly.

Steven Vachon · Answer 3 · Tue Feb 02 2016 19:14:30 GMT+0800 (China Standard Time)

We should probably support both strings and streams, which should bring us to a promise-based API as well.

Ivan Voischev · Answer 4 · Tue Feb 02 2016 21:08:14 GMT+0800 (China Standard Time)

look at this issue posthtml/posthtml#101 (comment)

Steven Vachon · Answer 5 · Mon Mar 07 2016 01:35:27 GMT+0800 (China Standard Time)

A regex-based IE® conditional comment parser could be written to assist parse5 in getting more data (see #9). parse5's location info can be used for reference points.

Steven Vachon · Answer 6 · Thu Aug 04 2016 06:30:08 GMT+0800 (China Standard Time)

Now that v0.9 introduced the options.parser, is there still an interest in swapping out htmlparser2 for parse5? I still think that the default should be true HTML, which means spec-compliant -- htmlparser2 is only pseudo-HTML.

Michael Ciniawsky · Answer 7 · Thu Aug 04 2016 07:07:00 GMT+0800 (China Standard Time)

@stevenvachon in progress :)

Parser using parse5 as the default parser
Core #159

Steven Vachon · Answer 8 · Thu Aug 04 2016 07:34:42 GMT+0800 (China Standard Time)

Awesome, thanks!

Jeff Escalante · Answer 9 · Thu Aug 04 2016 22:46:04 GMT+0800 (China Standard Time)

This is the proposed final parser if anyone has thoughts: https://github.com/static-dev/posthtml-parser2/pull/2

Steven Vachon · Answer 10 · Fri Aug 26 2016 22:35:50 GMT+0800 (China Standard Time)

As a result of @jescalan's PR getting rejected, I'm getting the feeling that this project will lay dormant for a long while until randomly receiving a 1.0 stamp, just like Grunt did. A quality, custom spec-compliant parser is too ambitious.

Jeff Escalante · Answer 11 · Sat Aug 27 2016 00:11:07 GMT+0800 (China Standard Time)

If you want to use the results of my PR immediately, you can use it through reshape. It's already being used in production at a few projects at my company and is working well. Will see a public release in the next month or so. The parser can be seen here, it uses parse5's SAXParser as discussed.

I can't speak to whether this assumption about posthtml is accurate, but if I had to guess I would tend to agree with you here. There hasn't been much if any progress made in the last couple months, and as you said writing a custom parser is a not really necessary and also too ambitious.

Michael Ciniawsky · Answer 12 · Sat Aug 27 2016 02:18:43 GMT+0800 (China Standard Time)

The 'problem' on the root is the limitation of the current AST, it's simply impossible to progressively enhance it. The PostHTML interface itself should be as flexible as possible e.g use htmlparser2, parse5, whatever, interchangeably up to the taste of the user and his/her's usecase, while the AST Format must be agreed upon, the parser to generate it is not important.

long-term, of course there should be a parser developed here and acting as the default...it's AST, and thats the important thing, will be the baseline for all other possible implementations.

There is already progress and the time it takes, it simply takes :)
AST

import parse5 from 'posthtml-parse5' // posthtml-pug, posthtml-hbs, posthtml-parser2, posthtml-parser(PostHTML's own HTML Parser)
import jsx from 'posthtml-jsx' // posthtml-js, posthtml-render

posthtml(plugins)
  .process('file.html', { parser: parse5, render: jsx }).... // etc...
  .then((result) => {
      result.html // with/without stringifier (e.g to js/jsx)
      result.tree // with/without generator
   })

Steven Vachon · Answer 13 · Sat Aug 27 2016 02:26:28 GMT+0800 (China Standard Time)

@michael-ciniawsky that wasn't the problem though. The argument was that the AST cannot change because "we have users".

Michael Ciniawsky · Answer 14 · Sat Aug 27 2016 03:02:31 GMT+0800 (China Standard Time)

@stevenvachon yeah well... :), this discussion...i would give it a go, the gist above is from one core maintainers and i believe there will be some progress, if not or with tooooo much time without even the indication of progress, the so called 'market' will simply move on to other solutions if they are 'superior'.

Steven Vachon · Answer 15 · Sun Apr 09 2017 12:47:19 GMT+0800 (China Standard Time)

@michael-ciniawsky and you want people to move on from your hard work?

Michael Ciniawsky · Answer 16 · Sun Apr 09 2017 18:38:37 GMT+0800 (China Standard Time)

@stevenvachon Yeah no... 😛 I was in the 'find a compromise' camp from the beginning, but I can't do anything in this regard. I would have a few updates upon my sleeve, which are blocked for a while now aswell. A 'unification' effort and progress would make the most sense from a sole projects standpoint, but I'm not blocking nor able to do something about it ¯_(ツ)_/¯

Steven Vachon · Answer 17 · Mon Apr 10 2017 00:07:10 GMT+0800 (China Standard Time)

What is blocking them? I don't see any work being done to unblock anything.

Jeff Escalante · Answer 18 · Tue Apr 11 2017 06:18:26 GMT+0800 (China Standard Time)

Hi, just want to pop in and clarify that nobody has ever reached out to me about any "unification effort", nor have I ever been opposed to it (which is why the features that turned into reshape started out as a PR to posthtml), so I hope it's not being implied that I am the cause of anything being "blocked."

I am entirely open any type of unification effort, so long as it does not end in a sacrifice of any of reshape's features, code quality, or test coverage. I'm available and happy to talk at any time.

✨ 💖 ✨

Steven Vachon · Answer 19 · Fri Jun 02 2017 22:13:56 GMT+0800 (China Standard Time)

cheerio 1.0.0-rc1 now uses parse5 by default.