mozilla / readability

A standalone version of the readability lib

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parsing Issue with DOMPurify and Readability.js

sgtcoder opened this issue · comments

I am having an issue with DOMPurify with Readability.js.

In the README it says that we should use DOMPurify to purify the content out, but I noticed it strips out the meta data. If I pass the content to Readability.js directly, it doesn't have any issues with parsing the data.

I tried many different configurations through trial and error, and I found one that's really close, but it's missing the datetime. I ensured that the datetime is being passed in meta.

Question, how important is it to have it pass to DOMPurify? It seems like you have to whitelist everything to get it to work and scripts automatically get stripped by readability.js.

Here is what I have:

const WHITELISTED_ATTR = [
  "content",
  "datetime",
  "itemprop",
  "property",
  "type",
  "time",
];

const WHITELISTED_TAGS = ["iframe", "video", "meta"];

const domPurifyOptions = {
  ADD_ATTR: WHITELISTED_ATTR,
  ADD_TAGS: WHITELISTED_TAGS,

  WHOLE_DOCUMENT: true,
};

var sanitized = DOMPurify.sanitize(response.data, domPurifyOptions);

Example link: https://ifeeltech.com/it-server-room-setup-guide

commented

Fundamentally, readability can't "find back" any content that DOMPurify strips out, if you run it on the input.

Whether you need DOMPurify (and with what configuration / arguments) or other sanitization mechanisms, CSP, or other defence in depth, is not really something I can comment on with the information you've shared here.

The security section in the readme is there to ensure that we're clear that readability does not aim to provide any security guarantees itself. As you noted, it removes script tags (because they clutter up the resulting returned content), but it won't necessarily remove other attributes that can lead to script being executed, be they onclick type event handler attributes, <a href="javascript:whatever"> inline JS URLs, or other even more niche techniques.

If you are creating the DOM in e.g. node and you're using JSDOM and turning off the script running inside JSDOM (in other words, there is no chance of script running before readability getting its hands on the DOM), you could run DOMPurify on the output of readability? Nothing (including the readme) says you have to run DOMPurify first. The point is just to be explicit that readability itself is not (and is not intended to be) a security mechanism.

Does that help?

So this makes sense. Run dompurify after running readability. That seems to work.

I am using Postlight/parser currently since I am getting better parsing results.

Does Pocket use this? Because Pocket is currently one of the best article parsers next to FocusReader (Feedbin API) on Android.

Postlight/parser and readability still don't produce the results of the above parsers which is strange.