Don’t remove <html> element from DOM tree
danburzo opened this issue · comments
On this page, the <html>
element has the class sidebar-visible
, which flags it for removal from DOM. Due to the way the code happens to work, this seems inconsequential to fetching the readable contents of the page, but when using Readability programmatically, in causes documentElement
on the passed Document become null
.
I believe the <html>
element should probably be spared in most (all?) cases.
Unfortunately, what readability does/doesn't do to the original document is not really a documented/supported part of the library. Arbitrary web pages can also do this (ie remove documentElement
from the DOM and keep running script in the background, and put it back at some later point). It's a little strange, but browsers and libraries have to cope with this (and occasionally it trips us up because it's such a weird situation).
I'm a little hesitant to add logic to try to avoid this situation. It's not a given that the html
node is the root element anyway (for the same reason), and it will be a relatively involved change to update all the code in readability that strips content out to doublecheck that it's not removing the root node. (and then, why stop there? Should we also keep body
? Or other parts?)
The fact that you are apparently relying on this under-documented bit of how readability behaves (ie what it does/doesn't strip out) also means that there may be others relying on this behaviour, and so touching it feels risky.
I say "under" rather than "un" because the readme does actually state:
This removes some elements in the web page, which may be undesirable. You can avoid this by passing the clone of the
document
object to theReadability
constructor
and this would probably fix your problem.
All in all, I think I would prefer not to touch this, so I'm going to close this out. I could be convinced otherwise, but for that it'd be helpful to understand more about why you think this is important and worth doing.