mozilla / readability

A standalone version of the readability lib

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Readability Aggressively Strips Important Content If an HTML Element's ID and/or Class Name Uses Certain Words

AppTyrant opened this issue · comments

Readability will strip out important content of almost any HTML element that has an id or class that's within the unlikelyCandidates regular expression without taking into account what kind of element the candidate is and where in the DOM tree the element is located. So if an element has an id or class name that contains any of the following words it will most likely be stripped from Firefox's reader view:

` /-ad-|ai2html|banner|breadcrumbs|combx|comment|community|cover-wrap|disqus|extra|footer|gdpr|header|legends|menu|related|remark|replies|rss|shoutbox|sidebar|skyscraper|social|sponsor|supplemental|ad-break|agegate|pagination|pager|popup|yom-remote/I

This causes Reader View to have missing content on many pages. Consider the following contrived example:

<!DOCTYPE html><html><head><meta charset="UTF-8"><title>GDPR Article</title></head>
<body>
<main>
<h2 id="gdpr-ArticleTitle">How Does GDPR Improve Privacy?</h2>
<p id="gdpr-IntroParagraph"><strong>This is the introductory paragraph.</strong></p>
<p class="gdpr-Paragraph">1) Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<p class="gdpr-Paragraph">2) Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<p class="gdpr-Paragraph">3) Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<p>4) Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<p>5) Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<p id="gdpr-LastParagraph"><strong>Last</strong></p>
</main>
</body>
</html>

Now if you open that HTML page in Firefox and activate Reader View lots of the HTML elements within the main element gets stripped. Perhaps Readability could mitigate this problem by doing one or more of the following:

  1. Take into account the tagName of element before deciding that it's an unlikely candidate that should be removed. Elements that primarily only contain text like h1, span, em, p, etc. could perhaps be excluded from this check (elements of these types that really should be removed will likely be wrapped in a container like a div that will fail this check).

  2. Take into account the element's location in the HTML document. If the element is a descendant of the main or article element (or a div with role="main") maybe it shouldn't be removed based on id or class name.

Or maybe when possible:
3) Elements that fail the unlikely candidates test could perhaps be held in a collection but shouldn't be removed from the document until after the article is "grabbed" and it can be determined that these unlikely candidates do not descend from the "grabbed article."

It's difficult to see how to address this issue, given it's filed with only the contrived example. If Readability stopped stripping all this content, we would regress many real world sites.

Take into account the tagName of element

This assumes that websites normally use semantic elements. Happily, some do, but many do not: some blog platforms still use <div> instead of <p>, #776 and #784 are on high profile sites that make similar mistakes. In your list, span is also regularly used for non-text. The containing div will not fail any checks if it does not itself have a similar class/id.

Take into account the element's location in the HTML document.

This has a similar problem as the above, and additionally, some of these classes/IDs are meant to be stripped, even inside the main body of the article. Ads, share buttons, related articles, or other junk in the main content shouldn't be preserved because they happen to be underneath an <article> tag.

Elements that fail the unlikely candidates test [...] shouldn't be removed from the document until after the article is "grabbed"

This is an interesting idea, but tricky to implement. It will also have the side effect of affecting scores based on content that is currently removed, and have similar problems to the second suggestion, in that we do intend to strip some of the content.

I appreciate the write-up and the suggestions, and noting this as a potential general issue. But my experience is that this is much harder to fix in a general fashion than making small adjustments when running into concrete examples of this problem. For instance, there are several targeted fixes that could address #799 (like pre-processing header elements to simplify them to only text, or excluding elements in headers from these checks (kind of like 1, but specific to headers...). So I'm inclined to close this in favour of addressing specifics, where it's also easier to add realistic tests and prioritize based on the importance/severity of the issue.