Missing section titles from substack.com
francisfeng opened this issue · comments
Steps to reproduce:
- Open https://ryancraventech.substack.com/p/the-case-against-100-code-coverage in Firefox.
- Enable reader view for the link.
- Section titles like “Here are some key reasons why 100% coverage should not be the ultimate goal:” are missing in the reader view.
Looks like there is a div in the h2
, which should not be allowed as it is not phrasing content.
We can look into updating the parsing to handle that somewhat common use case.
Hello @fchasen, thank you for considering this issue!
I would be happy to send a PR, but I don't know how to approach this. I expect it might be related to the _cleanConditionally
or _cleanHeaders
functions, but I'm not sure where to start to be honest.
Hoping anyone here can point me in the right direction as I would appreciate any guidance. 🙏
One approach I used was to clean the DOM manually first, before passing it to readability. In this case, say you have:
const response = await fetch(url);
const text = await response.text();
const dom = new JSDOM(text, { url });
preprocessDom(dom);
const reader = new Readability(dom.window.document);
In the preprocessDom you could do:
function preprocessDom(dom) {
const document = dom.window.document;
const headers = document.querySelectorAll("h1, h2, h3, h4, h5, h6");
headers.forEach((header) => {
// Remove classes and IDs, to overcome the 'negative regex'
header.removeAttribute("class");
header.removeAttribute("id");
// Remove all child nodes, the anchor divs in substack's case.
const textContent = header.textContent;
while (header.firstChild) {
header.removeChild(header.firstChild);
}
// Add only text content
header.textContent = textContent;
});
}
Mentioning other issue here as well, just in case someone there needs a quick fix: H1 Headers ignored/skipped on https://www.astralcodexten.com/p/practically-a-book-review-rootclaim