rviscomi / capo.js

Get your <head> in order

Home Page:https://rviscomi.github.io/capo.js/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Try fetching/parsing HTML client-side

rviscomi opened this issue · comments

In HTTPArchive/custom-metrics#12 (comment) @jroakes shared a screencast of some code for a document to parse its own body. Taking inspiration from that, I've modified the code a bit to produce a snippet that yields the entire contents of the static (server-rendered) <head>:

async function parse_own_body() {
    const url = document.location.href;
    let response = await fetch(url);
    let responseText = await response.text();
    return responseText;
}

let html = await parse_own_body();
html = html.replace(/(<\/?)(head)/ig, '$1static-head');
const staticDoc = document.implementation.createHTMLDocument("New Document");
staticDoc.documentElement.innerHTML = html;
const staticHead = staticDoc.querySelector('static-head');
staticHead;

The insight I had was that by renaming the head element to anything else, we can circumvent the HTML parser's behavior to truncate on invalid elements. Combined with fetching the raw HTML from the server, this should give us a pristine copy of the original head to use for capo analysis.

Screen Shot 2023-06-20 at 9 28 29 PM

This screenshot shows it working on the NYT site.

It should be possible to drop this approach into capo.js and fall back to the dynamic head as needed—I'd imagine some CSP rules blocking this use of fetch.

There might be sandbox limitations of doing something like this in a Chrome extension. I'll investigate.

Amusing side note: while testing this I discovered that the raw NYT HTML actually includes two <head> sections 😄

image

Maybe the NY Times changed something, but now there are 3 <head> tags in the html.

Now I count 4 😱

image