PermitScrubber treats ProcessingInstructions as Elements

Question

PermitScrubber treats ProcessingInstructions as Elements

dmpotter44 opened this issue 3 years ago · comments

The PermitScrubber class treats ProcessingInstructions as if they were Elements. (Actually, strictly speaking, it doesn't check the node type at all, beyond special handling for CDATA nodes, it just checks the name field.) This leads to some odd behavior:

require 'rails-html-sanitizer'
sanitizer = Rails::Html::SafeListSanitizer.new
# Sanitizer is using PermitScrubber with default properties
# The actual issue is within PermitScrubber, but for simplicity in showing the
# issue, these examples use a SafeListSanitizer
sanitizer.sanitize('<a href=example.html>Expected link</a><?a href=pi.html>not expected PI')
# => "<a href=\"example.html\">Expected link</a><?a href=pi.html>not expected PI"

However, this is because the PI has a white-listed name. So, for example:

sanitizer.sanitize('<b>Bold</b><?made up>')
# => "<b>Bold</b>"

I haven't figured out any way to do anything evil with this in modern browsers because modern browsers seem to correctly parse and ignore processing instructions. However, the contents of the PI are passed through to the final result unmodified, so:

sanitizer.sanitize('<?a <script>alert("Hello")//<?a </script>')
# => "<?a <script>alert(\"Hello\")//<?a </script>"

There was apparently an issue in older versions of Internet Explorer where PIs were not parsed correctly, potentially allowing the script to be run. It does not appear to run in modern browsers, though. (Chrome just changes the PIs into comments.)

It's worth noting that, yes, HTML has processing instructions, because SGML has processing instructions. <?a something> is a valid HTML processing instruction. It doesn't do anything, but it's valid SGML and may be parsed as such. (Unlike XML PIs, SGML PIs end with just > and not ?>.)

The solution is likely to update the scrub method in PermitScrubber to remove PIs. Alternatively, the allowed_node? method could be updated to determine if the node is really an element before checking it against the whitelist.

Mike Dalessio · Answer 1 · Wed May 26 2021 12:33:21 GMT+0800 (China Standard Time)

@dmpotter44 Thanks for opening this issue! I've reproduced what you're seeing.

I'll take a deeper look when I have a bit more time, but I think I agree with your assessment that these nodes should be stripped when sanitizing.

Mike Dalessio · Answer 2 · Mon Jul 19 2021 11:17:31 GMT+0800 (China Standard Time)

PR submitted at #116 to address this.