inikulin / parse5

HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

parse5 and streaming

jeromew opened this issue · comments

Is there a way to pipe a file into parse5 (this feature is available in htmlparser2).

similarly, is there a way to pause / resume the sax parser ? I believe that the getNextToken method could be used to decide when the parsing should pause/resume.

I have been using the html-tokenize streaming parser lately and its suite (html-select, trumpet) but reaching html5 conformity on html-tokenize is still a long way to go so I am trying to see how parse5 could fit in and be used with html-select

the Tokenizer and Preprocessor take the html content as one big chunk. Have you considered the possibility of working on partial chunks that would be written one after the other ?

parser.write('<div>')
parser.write('hello')
parser.write('</div>')
parser.end();

Hi,
currently parse5 doesn't support streaming. However we've discussed possibility of it's implementation with jsdom maintainers in context of this thread.

I would like to know which goal you would like to achieve using streaming? If it's primary focus is performance then I would like to warn you that I find it a little bit questionable. For streaming we need to teach tokenizer to invalidate non-emitted tokens if end of chunk was encountered. This behavior requires introduction of the tokenizer state snapshots mechanism: if we have invalidated token we need to rollback tokenizer to the last valid state and retreat preprocessor to the point there last valid token was emitted. Since tokenizer and preprocessor are the most performance-sensitive parts of parse5 this may end up with significant performance degradation, so you can lose more than you win.

Long story short. I definetly would like to see streaming API in parse5 too. But it requires quite complex research and I'm afraid I will not be able to get my hands on this soon, since we need to land more important features (like <template> support) first and I'm already suffering from the lack of the spare time. Therefore, any PR on this task will be highly appreciated.

Ok I see thanks for your feedback & for the link to the other thread. I agree that there is a big question mark on the result of all this performance wise.

I'll start my research with understanding the caveats of the html5 spec and try to understand how the approach taken by html-tokenize is compatible or not with all this.

Streaming would be useful in parsing HTML that contains template logic such as handlebars/etc.

@stevenvachon Sounds interesting. Can you describe this scenario in details, please? How streaming can help here?

@inikulin Having worked on my idea further, I realize that I was mistaken. Instead, #33 makes more sense.

The biggest benefit to streaming is in memory usage and garbage collection, not parse performance. Not having to parse an entire html file in a list of thousands has great benefits.

Also having a DOM to start working on before all the data has arrived.

Any progress on this at all?

In my case I'm using https://github.com/isaacs/sax-js and it work's great. The feature with sax-js that I needed where:

  • speed
  • parse an HTML even if it's not valid (a XML element which is not closed for instance)
  • retrieve location of each elements, attributes
  • works with web browser (without using browserify)

And sax-js provides me those features (for speed, I have not benchmark).

I tell me if it's a good idea for parse5 to spend time to implement a new sax parser although sax-js works great.

But perhaps I have missed something?

@angelozerr sax-js is not the HTML parser.
@stevenvachon I will be able to get my hands on it not earlier than end of July/August. Any PR is still welcome. If someone would like to start implementing it, then you can count on my assistance.

@angelozerr sax-js is not the HTML parser.

You could implement sax-js callback to create DOM node and having an HTML Document, no?

@angelozerr No, HTML parser requires html preprocessing/tokenization/parsing algorithms.

Ok, good news, everyone. I finally figured out how it should be done and parse5 will receive streaming support in the near future. One thing that I would like to set for the discussion agenda is the API. Should ParserStream and SerializerStream extend node's WritableStream and ReadableStream respectively, or it should just support stream-like API like htmlparser2? How you will obtain resulting AST, should ParserStream.end() return it? Or we should give user access to the unfinished AST via the property? The unsophisticated modification of the unfinished AST can brake parser (since we don't have DOM which guarantees that AST modifications will not lead to the malformed tree). This bothers me a little bit. Any ideas or suggestions?

Actually extending streams is awesome. Then you can pipe into it and it Just Works. stream-like but not actually streams give you such fun edge cases.

unfinished AST is definitely useful, but could be given with a stern warning.

Unfinished AST is ideal for me, but what kind of malformed tree could we end up with? Why not delay certain elements until more information is available? An example stream could be:

<p>asdfasdfasdf

followed by

asdfasasdf<p>asdfasdfasdf</p>

The parser wouldn't yet know if the first <p> is a void element or not. Instead of emitting it, the parser could hang onto it until it has access to future closing tags (or some kind of end()).

Actually... that creates a problem, doesn't it, because <html> wouldn't be emitted until the whole thing is done.

+1 for extending streams instead of home-cooked streams

I don't know if it is possible but having a streaming solution inspired on html-tokenize that would push data as soon as is it guaranteed to be "correctly" understood would be nice.

I already tried to do such a thing but was stopped in my progress. The code is just horrible and I was blocked because it needs a rewrite of html-tokenize to be more html5 conformant - https://github.com/jeromew/html-nest

there definitely is a problem with the html attributes because the spec says that additional html elements should push their attributes on the first one...this is not stream friendly at all ! Other than that I felt that in most scenario, we should not have to withhold tokens for too long in memory.

this is what I found during this experiment

  • There will always be an html element in the tree. The specification states that if an html opening tags is found in the 'in body' insertion mode, its attributes should extend the attributes of the first html element. Doing this would basically buffer the whole document in memory. A way to mitigate this could be to send provisional html tags
  • The table 'foster parenting' algorithm states that if we find elements inside a table that have nothing to do in the table, they should be reparented just before the table. In order to do this, we have to buffer the tables
  • The misnested tags are rectified by the 'adjacency adoption algorithm'. This algorithm tracks some formating elements (b, i, ..) and re-organizes locally the elements when a misnesting is detected. Sometimes the re-organization is triggered after tokens have already been processed. In order to follow this algorithm, we have to buffer tokens during formatting sections.

Regarding @aredridel's mention of a "stern warning" -- also include technical reasons as to why there is a warning. It would do two things: 1, thwart incorrect use; 2, enlighten us on what issues to avoid.

I seriously question whether it’s actually possible in practice to implement a fully spec-conformant streaming parser without it needing to also buffer the entire document. It seems that no matter what partial-buffering strategy you try, you can often end up needing to buffer such a large part of the document anyway that you don’t gain much versus just buffering the whole document to begin with.

The table 'foster parenting' algorithm states that if we find elements inside a table that have nothing to do in the table, they should be reparented just before the table. In order to do this, we have to buffer the tables.

Yeah, you always must buffer all tables entirely. And if consider how many documents out there are largely made of tables (e.g., pages that use tables for layout), in a lot of cases you’re going to need to end up buffering the majority of the document anyway.

The specification states that if an html opening tags is found in the 'in body' insertion mode, its attributes should extend the attributes of the first html element. Doing this would basically buffer the whole document in memory.

Yeah. I don’t understand how it’s possible to get around that without requiring all the consumers of your streaming parser to implement some kind of special handling for html elements.

A way to mitigate this could be to send provisional html tags

I don’t think I understand clearly what you mean by that.

Pinging @gsnedders (one of the html5lib devs) who might have some additional thoughts on this (though I doubt he can make time right now to respond here, and anyway my comments in #26 (comment) reflect a discussion I just now had with him on #whatwg IRC ).

@sideshowbarker

It's possible. Because:

In order to do this, we have to buffer the tables.

We shouldn't. We just walk up the stack of open elements and search for the table's parent:
https://github.com/inikulin/parse5/blob/master/lib/tree_construction/parser.js#L823

Doing this would basically buffer the whole document in memory.

Nope. Since we are in body we already have html element at the bottom of the stack of open elements.

Parser don't use token lookahead. So, the the bulk of the changes goes to tokenizer. We will make snapshots of the tokenizer state after each token emission. Then if we meet end-of-chunk, we invalidate last token, rollback to the last snapshot and suspend the parser. The next call to write() resumes the parser. This is it.

@inikulin thanks—looking back at the comments in this issue I see now I misunderstood what it’s about…

@sideshowbarker Ah, I see there this discussion started: servo/html5ever#149

My conclusion is that you can't have full spec compilant parsing without buffering already produced DOM-tree. In our case it's not an issue. parse5 has SAX-style parser, but it behaves more like tokenizer with the simulated parser feedback (CDATA parsing flag, switch text parsing modes).

@sideshowbarker Ah, I see there this discussion started: servo/html5ever#149

Yup

My conclusion is that you can't have full spec compliant parsing without buffering already produced DOM-tree.

Yeah, if you're going to conform to, e.g., the adoption-agency requirements and foster-parenting requirements and the special case of the html start tag, I guess you pretty much can't do that without building some kind of full tree. It may not actually be a full "DOM" as such, but I guess it's going to end up being effectively the same regardless of what you call it.

E.g., the code in https://github.com/validator/htmlparser/tree/master/src/nu/validator/saxtree provides an event-based SAX API by building something that code actually calls a "SAX tree"...

But I still can't figure out use cases for such approach. Using SAX parser most likely you will not need information about position of the element in the DOM-tree (It becomes even more absurd if you don't have DOM-tree).

But I still can't figure out use cases for such approach. Using SAX parser most likely you will not need information about position of the element in the DOM-tree (It becomes even more absurd if you don't have DOM-tree).

Yeah, I’ve never used that API. Instead I have used the fully-streaming SAX API that code also provides. That streaming API builds no tree and produces a fatal error for any markup cases that would require the adoption-agency algorithm or foster-parenting algorithm. That streaming API is actually what validator.nu and the W3C Nu Html Checker use. Those also report all the parse errors to the end user, so doing the streaming-but-stop-with-error-message-for-fatal-parse-errors thing makes sense in that context.

Ok, here is the fundamental question: should we abandon non-streaming API and release 2.0 or keep non-streaming API as well? If we will keep non-streaming at will be messed up in my taste:
Parser
SAXParser
Serializer
ParserStream
SAXParserStream
SerializerStream

I would like to expose Tokenizer, TokenSerializer and co-called ForgivenParser (like htmlparser2 but with the proper spec-compatible tokenization, integration points, etc.). With streaming alternatives for each of them it will be a real mess.
Please, let me know what do you think.

I don't see why we would need non-streaming anymore. htmlparser2 is stream-only with pseudo-non-streaming via a single write() call.

Totally awesome on the ForgivenParser! Though, perhaps rename it to ForgivingParser?

@stevenvachon Yep, it's should be ForgivingParser, just my typo

Totally looking forward to this. Will <{{tag}}> and <tag {{attr}}> be possible?

@stevenvachon nope, we will still use current tokenizer and current HTML5 lexical grammar doesn't allow such constructs.

@inikulin I'll defer to @Sebmaster but I think we just want JsDomParser to stay around, i.e. for streaming to primarily be via document.write calls.

@domenic Ok, I'll keep it. However, handling both document.write and streaming will be an interesting task. So, i think there will be a huge rewrite of the curent JSDOMParser code anyway.

@inikulin what is forgiving like htmlparser2, then?

@inikulin what is forgiving like htmlparser2, then?

Huh, seems like I'm starting to remember why I initially was against 'forgiving' parsing in parse5. Because there is no exact definition of the 'forgiving parsing'. Therefore people will always be somehow unhappy with this thing.
Conclusion: cancelled 🎉

lol, ok 😢