microcosm-cc / bluemonday

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

Home Page:https://github.com/microcosm-cc/bluemonday

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sanitization removes spacing

atombender opened this issue · comments

Consider the following string:

One study<sup>1</sup> demonstrated associations between green space exposure and improvement in behaviors and symptoms

When sanitized:

One study1 demonstrated associations between green space exposure and improvement in behaviors and symptoms

This is detrimental to tokenization, because what were three tokens in the original text: [one, study, 1] now becomes two: [one, study1].

I propose that some mechanism is added to allow the caller to control what kinds of breaks are inserted. For this, we'd probably want a zero-width space so that we can indicate that there is no actual space, but that they are two different tokens.

commented

This is not a bug. HTML is not white space sensitive outside of pre elements and similar.

study<sup>1</sup> is equivalent to a stripped tags study1, and if you wish for the <sup> to remain then you should add that to your allowed list of elements.

Or if you want to create a new piece of whitespace you should pre-process to introduce that whitespace. It wouldn't be a detriment where the whitespace already existed as HTML is not whitespace sensitive and multiple and mixed forms of whitespace would be collapsed.

It occurs to me that you might be doing text processing after running bluemonday, but the use-case for bluemonday is quite narrow: sanitize untrusted HTML for display in a web client safely (not introducing XSS or other security risks). We explicitly aren't trying to cleanup or otherwise correct output in some way as attempting to fulfil two purposes risks compromising the security purpose.