mozilla / readability

A standalone version of the readability lib

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Missing section titles from substack.com

francisfeng opened this issue · comments

Steps to reproduce:

  1. Open https://ryancraventech.substack.com/p/the-case-against-100-code-coverage in Firefox.
  2. Enable reader view for the link.
  3. Section titles like “Here are some key reasons why 100% coverage should not be the ultimate goal:” are missing in the reader view.

Original Webpage
Reader view

Looks like there is a div in the h2, which should not be allowed as it is not phrasing content.

We can look into updating the parsing to handle that somewhat common use case.

Hello @fchasen, thank you for considering this issue!

I would be happy to send a PR, but I don't know how to approach this. I expect it might be related to the _cleanConditionally or _cleanHeaders functions, but I'm not sure where to start to be honest.

Hoping anyone here can point me in the right direction as I would appreciate any guidance. 🙏

One approach I used was to clean the DOM manually first, before passing it to readability. In this case, say you have:

    const response = await fetch(url);
    const text = await response.text();

    const dom = new JSDOM(text, { url });
    preprocessDom(dom);

    const reader = new Readability(dom.window.document);

In the preprocessDom you could do:

function preprocessDom(dom) {
  const document = dom.window.document;
  const headers = document.querySelectorAll("h1, h2, h3, h4, h5, h6");

  headers.forEach((header) => {
    // Remove classes and IDs, to overcome the 'negative regex'
    header.removeAttribute("class");
    header.removeAttribute("id");

    // Remove all child nodes, the anchor divs in substack's case.
    const textContent = header.textContent;
    while (header.firstChild) {
      header.removeChild(header.firstChild);
    }
    // Add only text content
    header.textContent = textContent;
  });
}

Mentioning other issue here as well, just in case someone there needs a quick fix: H1 Headers ignored/skipped on https://www.astralcodexten.com/p/practically-a-book-review-rootclaim