Try fetching/parsing HTML client-side

Question

Try fetching/parsing HTML client-side

rviscomi opened this issue a year ago · comments

In HTTPArchive/custom-metrics#12 (comment) @jroakes shared a screencast of some code for a document to parse its own body. Taking inspiration from that, I've modified the code a bit to produce a snippet that yields the entire contents of the static (server-rendered) <head>:

async function parse_own_body() {
    const url = document.location.href;
    let response = await fetch(url);
    let responseText = await response.text();
    return responseText;
}

let html = await parse_own_body();
html = html.replace(/(<\/?)(head)/ig, '$1static-head');
const staticDoc = document.implementation.createHTMLDocument("New Document");
staticDoc.documentElement.innerHTML = html;
const staticHead = staticDoc.querySelector('static-head');
staticHead;

The insight I had was that by renaming the head element to anything else, we can circumvent the HTML parser's behavior to truncate on invalid elements. Combined with fetching the raw HTML from the server, this should give us a pristine copy of the original head to use for capo analysis.

This screenshot shows it working on the NYT site.

It should be possible to drop this approach into capo.js and fall back to the dynamic head as needed—I'd imagine some CSP rules blocking this use of fetch.

There might be sandbox limitations of doing something like this in a Chrome extension. I'll investigate.

Rick Viscomi · Answer 1 · Wed Jun 21 2023 09:37:37 GMT+0800 (China Standard Time)

Amusing side note: while testing this I discovered that the raw NYT HTML actually includes two <head> sections 😄

janwillemwilmsen · Answer 2 · Mon Jun 26 2023 20:01:08 GMT+0800 (China Standard Time)

Maybe the NY Times changed something, but now there are 3 <head> tags in the html.

Rick Viscomi · Answer 3 · Tue Jun 27 2023 21:29:18 GMT+0800 (China Standard Time)

Now I count 4 😱