Try fetching/parsing HTML client-side
rviscomi opened this issue · comments
In HTTPArchive/custom-metrics#12 (comment) @jroakes shared a screencast of some code for a document to parse its own body. Taking inspiration from that, I've modified the code a bit to produce a snippet that yields the entire contents of the static (server-rendered) <head>
:
async function parse_own_body() {
const url = document.location.href;
let response = await fetch(url);
let responseText = await response.text();
return responseText;
}
let html = await parse_own_body();
html = html.replace(/(<\/?)(head)/ig, '$1static-head');
const staticDoc = document.implementation.createHTMLDocument("New Document");
staticDoc.documentElement.innerHTML = html;
const staticHead = staticDoc.querySelector('static-head');
staticHead;
The insight I had was that by renaming the head
element to anything else, we can circumvent the HTML parser's behavior to truncate on invalid elements. Combined with fetching the raw HTML from the server, this should give us a pristine copy of the original head
to use for capo analysis.
![Screen Shot 2023-06-20 at 9 28 29 PM](https://private-user-images.githubusercontent.com/1120896/247371177-94d32a08-ee26-4600-bf5b-836e46fa7f73.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjA1NzEyODcsIm5iZiI6MTcyMDU3MDk4NywicGF0aCI6Ii8xMTIwODk2LzI0NzM3MTE3Ny05NGQzMmEwOC1lZTI2LTQ2MDAtYmY1Yi04MzZlNDZmYTdmNzMucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcxMCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MTBUMDAyMzA3WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDEzZjdlODRiMjVjNTY4ZjNjMDZiNmUzNDhhYWYzM2EwMDEzZmQ0MjVjODVkOWJkN2MzYTg4ZjRkZDdjODkyMSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.N4yo6L4qbn3F8QX59Gh-I6rOPx1W2junRC3DdQrU4ag)
This screenshot shows it working on the NYT site.
It should be possible to drop this approach into capo.js and fall back to the dynamic head
as needed—I'd imagine some CSP rules blocking this use of fetch
.
There might be sandbox limitations of doing something like this in a Chrome extension. I'll investigate.
Maybe the NY Times changed something, but now there are 3 <head>
tags in the html.