Don’t remove <html> element from DOM tree

Question

Don’t remove <html> element from DOM tree

danburzo opened this issue 5 months ago · comments

On this page, the <html> element has the class sidebar-visible, which flags it for removal from DOM. Due to the way the code happens to work, this seems inconsequential to fetching the readable contents of the page, but when using Readability programmatically, in causes documentElement on the passed Document become null.

I believe the <html> element should probably be spared in most (all?) cases.

Gijs · Answer 1 · Sat Feb 10 2024 00:37:58 GMT+0800 (China Standard Time)

Unfortunately, what readability does/doesn't do to the original document is not really a documented/supported part of the library. Arbitrary web pages can also do this (ie remove documentElement from the DOM and keep running script in the background, and put it back at some later point). It's a little strange, but browsers and libraries have to cope with this (and occasionally it trips us up because it's such a weird situation).

I'm a little hesitant to add logic to try to avoid this situation. It's not a given that the html node is the root element anyway (for the same reason), and it will be a relatively involved change to update all the code in readability that strips content out to doublecheck that it's not removing the root node. (and then, why stop there? Should we also keep body? Or other parts?)

The fact that you are apparently relying on this under-documented bit of how readability behaves (ie what it does/doesn't strip out) also means that there may be others relying on this behaviour, and so touching it feels risky.

I say "under" rather than "un" because the readme does actually state:

This removes some elements in the web page, which may be undesirable. You can avoid this by passing the clone of the document object to the Readability constructor

and this would probably fix your problem.

All in all, I think I would prefer not to touch this, so I'm going to close this out. I could be convinced otherwise, but for that it'd be helpful to understand more about why you think this is important and worth doing.