Missing section titles from substack.com

Question

Missing section titles from substack.com

francisfeng opened this issue 10 months ago · comments

Francis Feng commented 10 months ago

Steps to reproduce:

Open https://ryancraventech.substack.com/p/the-case-against-100-code-coverage in Firefox.
Enable reader view for the link.
Section titles like “Here are some key reasons why 100% coverage should not be the ultimate goal:” are missing in the reader view.

Fred Chasen · Answer 1 · Sat Jan 13 2024 00:44:08 GMT+0800 (China Standard Time)

Looks like there is a div in the h2, which should not be allowed as it is not phrasing content.

We can look into updating the parsing to handle that somewhat common use case.

Laurent Cazanove · Answer 2 · Fri Jan 19 2024 05:23:09 GMT+0800 (China Standard Time)

Hello @fchasen, thank you for considering this issue!

I would be happy to send a PR, but I don't know how to approach this. I expect it might be related to the _cleanConditionally or _cleanHeaders functions, but I'm not sure where to start to be honest.

Hoping anyone here can point me in the right direction as I would appreciate any guidance. 🙏

Hasir Mushtaq · Answer 3 · Sun Jun 02 2024 01:21:30 GMT+0800 (China Standard Time)

One approach I used was to clean the DOM manually first, before passing it to readability. In this case, say you have:

    const response = await fetch(url);
    const text = await response.text();

    const dom = new JSDOM(text, { url });
    preprocessDom(dom);

    const reader = new Readability(dom.window.document);

In the preprocessDom you could do:

function preprocessDom(dom) {
  const document = dom.window.document;
  const headers = document.querySelectorAll("h1, h2, h3, h4, h5, h6");

  headers.forEach((header) => {
    // Remove classes and IDs, to overcome the 'negative regex'
    header.removeAttribute("class");
    header.removeAttribute("id");

    // Remove all child nodes, the anchor divs in substack's case.
    const textContent = header.textContent;
    while (header.firstChild) {
      header.removeChild(header.firstChild);
    }
    // Add only text content
    header.textContent = textContent;
  });
}

Mentioning other issue here as well, just in case someone there needs a quick fix: H1 Headers ignored/skipped on https://www.astralcodexten.com/p/practically-a-book-review-rootclaim