parse5 and streaming

Question

parse5 and streaming

jeromew opened this issue 10 years ago · comments

Is there a way to pipe a file into parse5 (this feature is available in htmlparser2).

similarly, is there a way to pause / resume the sax parser ? I believe that the getNextToken method could be used to decide when the parsing should pause/resume.

I have been using the html-tokenize streaming parser lately and its suite (html-select, trumpet) but reaching html5 conformity on html-tokenize is still a long way to go so I am trying to see how parse5 could fit in and be used with html-select

jeromew · Answer 1 · Wed Oct 08 2014 23:51:33 GMT+0800 (China Standard Time)

the Tokenizer and Preprocessor take the html content as one big chunk. Have you considered the possibility of working on partial chunks that would be written one after the other ?

parser.write('<div>')
parser.write('hello')
parser.write('</div>')
parser.end();

Ivan Nikulin · Answer 2 · Thu Oct 09 2014 01:01:27 GMT+0800 (China Standard Time)

Hi,
currently parse5 doesn't support streaming. However we've discussed possibility of it's implementation with jsdom maintainers in context of this thread.

I would like to know which goal you would like to achieve using streaming? If it's primary focus is performance then I would like to warn you that I find it a little bit questionable. For streaming we need to teach tokenizer to invalidate non-emitted tokens if end of chunk was encountered. This behavior requires introduction of the tokenizer state snapshots mechanism: if we have invalidated token we need to rollback tokenizer to the last valid state and retreat preprocessor to the point there last valid token was emitted. Since tokenizer and preprocessor are the most performance-sensitive parts of parse5 this may end up with significant performance degradation, so you can lose more than you win.

Long story short. I definetly would like to see streaming API in parse5 too. But it requires quite complex research and I'm afraid I will not be able to get my hands on this soon, since we need to land more important features (like <template> support) first and I'm already suffering from the lack of the spare time. Therefore, any PR on this task will be highly appreciated.

jeromew · Answer 3 · Thu Oct 09 2014 01:13:13 GMT+0800 (China Standard Time)

Ok I see thanks for your feedback & for the link to the other thread. I agree that there is a big question mark on the result of all this performance wise.

I'll start my research with understanding the caveats of the html5 spec and try to understand how the approach taken by html-tokenize is compatible or not with all this.

Steven Vachon · Answer 4 · Tue Dec 30 2014 05:55:59 GMT+0800 (China Standard Time)

~~Streaming would be useful in parsing HTML that contains template logic such as handlebars/etc.~~

Ivan Nikulin · Answer 5 · Sun Jan 04 2015 01:13:32 GMT+0800 (China Standard Time)

@stevenvachon Sounds interesting. Can you describe this scenario in details, please? How streaming can help here?

Steven Vachon · Answer 6 · Tue Jan 06 2015 05:40:06 GMT+0800 (China Standard Time)

@inikulin Having worked on my idea further, I realize that I was mistaken. ~~Instead, #33 makes more sense.~~

Steven Vachon · Answer 7 · Sun Jan 11 2015 02:06:35 GMT+0800 (China Standard Time)

The biggest benefit to streaming is in memory usage and garbage collection, not parse performance. Not having to parse an entire html file in a list of thousands has great benefits.

Aria Stewart · Answer 8 · Sat Jan 17 2015 00:13:22 GMT+0800 (China Standard Time)

Also having a DOM to start working on before all the data has arrived.

Steven Vachon · Answer 9 · Thu Jun 11 2015 23:41:09 GMT+0800 (China Standard Time)

Any progress on this at all?

Angelo · Answer 10 · Fri Jun 12 2015 15:19:49 GMT+0800 (China Standard Time)

In my case I'm using https://github.com/isaacs/sax-js and it work's great. The feature with sax-js that I needed where:

speed
parse an HTML even if it's not valid (a XML element which is not closed for instance)
retrieve location of each elements, attributes
works with web browser (without using browserify)

And sax-js provides me those features (for speed, I have not benchmark).

I tell me if it's a good idea for parse5 to spend time to implement a new sax parser although sax-js works great.

But perhaps I have missed something?

Ivan Nikulin · Answer 11 · Mon Jun 15 2015 18:25:10 GMT+0800 (China Standard Time)

@angelozerr sax-js is not the HTML parser.
@stevenvachon I will be able to get my hands on it not earlier than end of July/August. Any PR is still welcome. If someone would like to start implementing it, then you can count on my assistance.

Angelo · Answer 12 · Mon Jun 15 2015 18:29:23 GMT+0800 (China Standard Time)

@angelozerr sax-js is not the HTML parser.

You could implement sax-js callback to create DOM node and having an HTML Document, no?

Ivan Nikulin · Answer 13 · Mon Jun 15 2015 19:02:55 GMT+0800 (China Standard Time)

@angelozerr No, HTML parser requires html preprocessing/tokenization/parsing algorithms.

Ivan Nikulin · Answer 14 · Fri Jun 19 2015 04:27:43 GMT+0800 (China Standard Time)

Ok, good news, everyone. I finally figured out how it should be done and parse5 will receive streaming support in the near future. One thing that I would like to set for the discussion agenda is the API. Should ParserStream and SerializerStream extend node's WritableStream and ReadableStream respectively, or it should just support stream-like API like htmlparser2? How you will obtain resulting AST, should ParserStream.end() return it? Or we should give user access to the unfinished AST via the property? The unsophisticated modification of the unfinished AST can brake parser (since we don't have DOM which guarantees that AST modifications will not lead to the malformed tree). This bothers me a little bit. Any ideas or suggestions?

Aria Stewart · Answer 15 · Fri Jun 19 2015 05:38:10 GMT+0800 (China Standard Time)

Actually extending streams is awesome. Then you can pipe into it and it Just Works. stream-like but not actually streams give you such fun edge cases.

Aria Stewart · Answer 16 · Fri Jun 19 2015 05:38:32 GMT+0800 (China Standard Time)

unfinished AST is definitely useful, but could be given with a stern warning.

Steven Vachon · Answer 17 · Fri Jun 19 2015 05:42:36 GMT+0800 (China Standard Time)

~~Unfinished AST is ideal for me, but what kind of malformed tree could we end up with? Why not delay certain elements until more information is available? An example stream could be:~~

<p>asdfasdfasdf

followed by

asdfasasdf<p>asdfasdfasdf</p>

~~The parser wouldn't yet know if the first <p> is a void element or not. Instead of emitting it, the parser could hang onto it until it has access to future closing tags (or some kind of end()).~~

Actually... that creates a problem, doesn't it, because <html> wouldn't be emitted until the whole thing is done.

jeromew · Answer 18 · Fri Jun 19 2015 05:47:33 GMT+0800 (China Standard Time)

+1 for extending streams instead of home-cooked streams

I don't know if it is possible but having a streaming solution inspired on html-tokenize that would push data as soon as is it guaranteed to be "correctly" understood would be nice.

I already tried to do such a thing but was stopped in my progress. The code is just horrible and I was blocked because it needs a rewrite of html-tokenize to be more html5 conformant - https://github.com/jeromew/html-nest

there definitely is a problem with the html attributes because the spec says that additional html elements should push their attributes on the first one...this is not stream friendly at all ! Other than that I felt that in most scenario, we should not have to withhold tokens for too long in memory.

this is what I found during this experiment

There will always be an html element in the tree. The specification states that if an html opening tags is found in the 'in body' insertion mode, its attributes should extend the attributes of the first html element. Doing this would basically buffer the whole document in memory. A way to mitigate this could be to send provisional html tags
The table 'foster parenting' algorithm states that if we find elements inside a table that have nothing to do in the table, they should be reparented just before the table. In order to do this, we have to buffer the tables
The misnested tags are rectified by the 'adjacency adoption algorithm'. This algorithm tracks some formating elements (b, i, ..) and re-organizes locally the elements when a misnesting is detected. Sometimes the re-organization is triggered after tokens have already been processed. In order to follow this algorithm, we have to buffer tokens during formatting sections.

Steven Vachon · Answer 19 · Fri Jun 19 2015 05:50:00 GMT+0800 (China Standard Time)

Regarding @aredridel's mention of a "stern warning" -- also include technical reasons as to why there is a warning. It would do two things: 1, thwart incorrect use; 2, enlighten us on what issues to avoid.

sideshowbarker · Answer 20 · Tue Jul 14 2015 10:19:23 GMT+0800 (China Standard Time)

I seriously question whether it’s actually possible in practice to implement a fully spec-conformant streaming parser without it needing to also buffer the entire document. It seems that no matter what partial-buffering strategy you try, you can often end up needing to buffer such a large part of the document anyway that you don’t gain much versus just buffering the whole document to begin with.

The table 'foster parenting' algorithm states that if we find elements inside a table that have nothing to do in the table, they should be reparented just before the table. In order to do this, we have to buffer the tables.

Yeah, you always must buffer all tables entirely. And if consider how many documents out there are largely made of tables (e.g., pages that use tables for layout), in a lot of cases you’re going to need to end up buffering the majority of the document anyway.

The specification states that if an html opening tags is found in the 'in body' insertion mode, its attributes should extend the attributes of the first html element. Doing this would basically buffer the whole document in memory.

Yeah. I don’t understand how it’s possible to get around that without requiring all the consumers of your streaming parser to implement some kind of special handling for html elements.

A way to mitigate this could be to send provisional html tags

I don’t think I understand clearly what you mean by that.

sideshowbarker · Answer 21 · Tue Jul 14 2015 10:24:01 GMT+0800 (China Standard Time)

Pinging @gsnedders (one of the html5lib devs) who might have some additional thoughts on this (though I doubt he can make time right now to respond here, and anyway my comments in #26 (comment) reflect a discussion I just now had with him on #whatwg IRC ).

Ivan Nikulin · Answer 22 · Tue Jul 14 2015 16:09:17 GMT+0800 (China Standard Time)

@sideshowbarker

It's possible. Because:

In order to do this, we have to buffer the tables.

We shouldn't. We just walk up the stack of open elements and search for the table's parent:
https://github.com/inikulin/parse5/blob/master/lib/tree_construction/parser.js#L823

Doing this would basically buffer the whole document in memory.

Nope. Since we are in body we already have html element at the bottom of the stack of open elements.

Parser don't use token lookahead. So, the the bulk of the changes goes to tokenizer. We will make snapshots of the tokenizer state after each token emission. Then if we meet end-of-chunk, we invalidate last token, rollback to the last snapshot and suspend the parser. The next call to write() resumes the parser. This is it.

sideshowbarker · Answer 23 · Tue Jul 14 2015 16:24:04 GMT+0800 (China Standard Time)

@inikulin thanks—looking back at the comments in this issue I see now I misunderstood what it’s about…

Ivan Nikulin · Answer 24 · Tue Jul 14 2015 20:41:56 GMT+0800 (China Standard Time)

@sideshowbarker Ah, I see there this discussion started: servo/html5ever#149

My conclusion is that you can't have full spec compilant parsing without buffering already produced DOM-tree. In our case it's not an issue. parse5 has SAX-style parser, but it behaves more like tokenizer with the simulated parser feedback (CDATA parsing flag, switch text parsing modes).

sideshowbarker · Answer 25 · Tue Jul 14 2015 21:07:23 GMT+0800 (China Standard Time)

@sideshowbarker Ah, I see there this discussion started: servo/html5ever#149

Yup

My conclusion is that you can't have full spec compliant parsing without buffering already produced DOM-tree.

Yeah, if you're going to conform to, e.g., the adoption-agency requirements and foster-parenting requirements and the special case of the html start tag, I guess you pretty much can't do that without building some kind of full tree. It may not actually be a full "DOM" as such, but I guess it's going to end up being effectively the same regardless of what you call it.

E.g., the code in https://github.com/validator/htmlparser/tree/master/src/nu/validator/saxtree provides an event-based SAX API by building something that code actually calls a "SAX tree"...

Ivan Nikulin · Answer 26 · Tue Jul 14 2015 21:12:26 GMT+0800 (China Standard Time)

But I still can't figure out use cases for such approach. Using SAX parser most likely you will not need information about position of the element in the DOM-tree (It becomes even more absurd if you don't have DOM-tree).

sideshowbarker · Answer 27 · Tue Jul 14 2015 21:24:06 GMT+0800 (China Standard Time)

But I still can't figure out use cases for such approach. Using SAX parser most likely you will not need information about position of the element in the DOM-tree (It becomes even more absurd if you don't have DOM-tree).

Yeah, I’ve never used that API. Instead I have used the fully-streaming SAX API that code also provides. That streaming API builds no tree and produces a fatal error for any markup cases that would require the adoption-agency algorithm or foster-parenting algorithm. That streaming API is actually what validator.nu and the W3C Nu Html Checker use. Those also report all the parse errors to the end user, so doing the streaming-but-stop-with-error-message-for-fatal-parse-errors thing makes sense in that context.

Ivan Nikulin · Answer 28 · Sat Aug 01 2015 03:16:30 GMT+0800 (China Standard Time)

Ok, here is the fundamental question: should we abandon non-streaming API and release 2.0 or keep non-streaming API as well? If we will keep non-streaming at will be messed up in my taste:
Parser
SAXParser
Serializer
ParserStream
SAXParserStream
SerializerStream

I would like to expose Tokenizer, TokenSerializer and co-called ForgivenParser (like htmlparser2 but with the proper spec-compatible tokenization, integration points, etc.). With streaming alternatives for each of them it will be a real mess.
Please, let me know what do you think.

Steven Vachon · Answer 29 · Sat Aug 01 2015 03:23:28 GMT+0800 (China Standard Time)

I don't see why we would need non-streaming anymore. htmlparser2 is stream-only with pseudo-non-streaming via a single write() call.

Totally awesome on the ForgivenParser! Though, perhaps rename it to ForgivingParser?

Ivan Nikulin · Answer 30 · Sat Aug 01 2015 03:24:50 GMT+0800 (China Standard Time)

@stevenvachon Yep, it's should be ForgivingParser, just my typo

Steven Vachon · Answer 31 · Sat Aug 01 2015 03:25:26 GMT+0800 (China Standard Time)

Totally looking forward to this. Will <{{tag}}> and <tag {{attr}}> be possible?

Ivan Nikulin · Answer 32 · Sat Aug 01 2015 03:29:50 GMT+0800 (China Standard Time)

@stevenvachon nope, we will still use current tokenizer and current HTML5 lexical grammar doesn't allow such constructs.

Domenic Denicola · Answer 33 · Sat Aug 01 2015 03:34:11 GMT+0800 (China Standard Time)

@inikulin I'll defer to @Sebmaster but I think we just want JsDomParser to stay around, i.e. for streaming to primarily be via document.write calls.

Ivan Nikulin · Answer 34 · Sat Aug 01 2015 03:38:34 GMT+0800 (China Standard Time)

@domenic Ok, I'll keep it. However, handling both document.write and streaming will be an interesting task. So, i think there will be a huge rewrite of the curent JSDOMParser code anyway.

Steven Vachon · Answer 35 · Sat Aug 01 2015 04:38:00 GMT+0800 (China Standard Time)

@inikulin what is forgiving like htmlparser2, then?

Ivan Nikulin · Answer 36 · Tue Aug 04 2015 23:03:13 GMT+0800 (China Standard Time)

@inikulin what is forgiving like htmlparser2, then?

Huh, seems like I'm starting to remember why I initially was against 'forgiving' parsing in parse5. Because there is no exact definition of the 'forgiving parsing'. Therefore people will always be somehow unhappy with this thing.
Conclusion: cancelled 🎉

Steven Vachon · Answer 37 · Tue Aug 04 2015 23:10:47 GMT+0800 (China Standard Time)

lol, ok 😢