Protect against script injection

Question

Protect against script injection

collosi opened this issue 12 years ago · comments

In the interest of "safety against malicious user input", shouldn't there be an option to prevent the passthrough of script tags?

Russ Ross · Answer 1 · Thu Mar 29 2012 12:57:23 GMT+0800 (China Standard Time)

The safety the README file refers to is against crashing the server. You have a good point, though. You can disable block HTML with the HTML_SKIP_HTML option, but inline tags are permitted, including script tags and attributes like onload. It seems like it would take some careful planning to really eliminate any possibility of javascript injection.

I'm curious how other libraries have handled the issue. Blackfriday was based on upskirt and implements the same feature set. Sundown (a fork of upskirt) seems to have taken over as the engine of choice in C. I wonder if they have addressed this? I'll look into it when I get a chance. If you have any suggestions or insights, please let me know!

collosi · Answer 2 · Thu Mar 29 2012 21:21:09 GMT+0800 (China Standard Time)

You know thinking about it a bit more, I'm going to guess that an approach like the one described in the answer to this question http://stackoverflow.com/questions/6659351/removing-all-script-tags-from-html-with-js-regular-expression is going to be better in the long run anyway. That is, use the browser to parse the resulting html, and remove scripts at that point.

Tim Hutt · Answer 3 · Sun Dec 29 2013 09:29:56 GMT+0800 (China Standard Time)

Yes you should definitely update the readme to clarify that it is safe in the sense that it won't crash, but it is NOT safe in the sense that you'd allow untrusted users to display output from it to the world.

I'd also like to point out that tag blacklisting (i.e. "we will strip out all script tags") is absolutely the wrong way to do this input sanitising. It does not and cannot work. Browsers are no-way near well-defined enough. Check out this page: https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet

The only secure way to do it is to strip all HTML (i.e. replace all < and >'s with < and >) and then add the features you want as non-HTML extensions (like how the tables are done). This is how major markdown-based sites like reddit and github do it. This should really be the default for this library. Anything else is asking for trouble.

Great library by the way. I've converted it to javascript using gopherjs and it seems to work fine! (Although I obviously can't use it in anger until it is secure!)

Tim Hutt · Answer 4 · Sun Dec 29 2013 09:46:52 GMT+0800 (China Standard Time)

As an example (i.e. not something that you should just fix and think the problem is solved), this input:

<script><script src="http://example.com/exploit.js"></SCRIPT></script>

gives this output:

<p><script src="http://example.com/exploit.js"></SCRIPT></p>

That is with the "common" options which include HTML_SKIP_SCRIPT. I strongly recommend that you remove HTML_SKIP_SCRIPT for now as it is extremely misleading - people may think that they are protected from script injection when they aren't.

Once again, this is just one example of why this approach doesn't work and can never work. It is not that there are a few bugs in the code that can be fixed and then it will be secure, rather the entire approach is flawed.

Sorry if I'm being a bit forceful here. It's an important issue that many people get wrong and this is otherwise a nice library which I would like to be successful. Keep up the good work!

Vytautas Šaltenis · Answer 5 · Wed Jan 22 2014 07:54:46 GMT+0800 (China Standard Time)

@Timmmm, you seem to be knowledgeable on these matters, can you please take a look at #50?

Tim Hutt · Answer 6 · Sat Feb 01 2014 05:55:45 GMT+0800 (China Standard Time)

I took a look, and unfortunately it kind of does exactly what I said not to do! That is, it tries to fix the HTML parsing to catch as many script injection attacks as possible, rather than just making them all impossible by disallowing all HTML.

In fact in the test code you can see some attacks which still work (the tests are commented out). Those are only the known attacks which still work - you can be sure there are others that haven't been thought of.

Therefore, I would apply this patch because it is definitely an improvement, and it is very frustrating to see one's work go to waste (and I kind of hate naysayers).

_However_, this doesn't actually change the security of blackfriday - it should still be considered vulnerable by anyone sensible and you should probably still update the README to reflect this.

Sorry I don't really have the motivation to fix this properly myself at the moment. I am a total hypocrite!

Vytautas Šaltenis · Answer 7 · Sat Feb 01 2014 06:08:32 GMT+0800 (China Standard Time)

Yeah, I know it's not what you recommended, because it sanitizes output, not input. I'd love to sanitize input instead, but couldn't come up with a way to do that from your half-sentence hint. Can you elaborate or point to some code that does what you had in mind?

Tim Hutt · Answer 8 · Sat Feb 01 2014 06:30:36 GMT+0800 (China Standard Time)

Well basically replace every < and > in the input with > and <. That way you can be 99.99% sure that there will be no user-supplied HTML in the output. Then you have to add back the HTML features that you've lost (

Vytautas Šaltenis · Answer 9 · Sat Feb 01 2014 06:42:32 GMT+0800 (China Standard Time)

Hmm, I don't understand how can this allow having inline html at all. Replacing angle brackets with lt/gt entities will render inline html as readable html in the output.

Tim Hutt · Answer 10 · Sat Feb 01 2014 06:45:35 GMT+0800 (China Standard Time)

Yeah exactly. You lose convenience but it is necessary if you want actual
security.

Although having said that, with some care you could whitelist some tags
e.g.

, and so on. Probably not much point though as the most useful
HTML is the most risky.
On 31 Jan 2014 22:42, "Vytautas Šaltenis" notifications@github.com wrote:

Hmm, I don't understand how can this allow having inline html at all.
Replacing angle brackets with lt/gt entities will render inline html as
readable html in the output.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/11#issuecomment-33850118
.

Vytautas Šaltenis · Answer 11 · Sat Feb 01 2014 07:01:23 GMT+0800 (China Standard Time)

s/convenience/functionality ;-)

OK, now I see what you mean and it turns out we're talking about slightly different things. I'm trying to sanitize the output (or input, if possible) while keeping inline html functionality and you propose dropping it. Which makes perfect sense, but I will leave the decision about the defaults to @russross . I'll also update the readme as you suggested, it should certainly not mislead.

collosi · Answer 12 · Fri Feb 28 2014 22:45:00 GMT+0800 (China Standard Time)

I think Timmmm might be right. That the only way to actually make this secure would be to prevent passthrough of HTML all together. It seems like there are two modes of use here, involving trusted vs untrusted content. When the content is trusted (e.g. the markdown is generated by the developers/designers/writers of the site itself), protecting against crashes and other undefined behavior is the primary goal. When the content is untrusted (e.g. submitted by users of the site) that paranoia is in order, and it is reasonable to take an approach that restricts all HTML content.

Vytautas Šaltenis · Answer 13 · Fri Feb 28 2014 22:59:53 GMT+0800 (China Standard Time)

You can do that today with HTML_SKIP_HTML, it's just not a part of the default set of flags for MarkdownBasic() or MarkdownCommon().

Tim Hutt · Answer 14 · Mon Mar 03 2014 06:58:39 GMT+0800 (China Standard Time)

But HTML_SKIP_HTML is not currently secure.
On 28 Feb 2014 14:59, "Vytautas Šaltenis" notifications@github.com wrote:

You can do that today with HTML_SKIP_HTML, it's just not a part of the
default set of flags for MarkdownBasic() or MarkdownCommon().

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/11#issuecomment-36358040
.

Vytautas Šaltenis · Answer 15 · Sat Apr 19 2014 05:32:14 GMT+0800 (China Standard Time)

A particular test case reported in #65, copying it here to have all related stuff in one place:

[FUCKLINK][1]

[1]: javascript:alert(window.document.cookie);

Martin Probst · Answer 16 · Mon Apr 28 2014 03:59:45 GMT+0800 (China Standard Time)

I think the only possible way to reliably sanitize the result is actually parsing/tokenizing the HTML after it's been created, then whitelisting tags/attributes based on that, and then generating the output from the parsed HTML.

I've poked around a bit using the go.net/html HTML5 parsing library, that seems to be working. It might be slightly slower than regexp based shenanigans, I haven't tested that yet, but it will be safe. I wrote it such that unrecognized HTML gets escaped (<script src=evil> ...), so users can see what's going on.

Would you be interested in a pull request?

Martin Probst · Answer 17 · Mon Apr 28 2014 04:03:32 GMT+0800 (China Standard Time)

By the way, a possibly more elegant and efficient way would be parsing the entire HTML into a tree or token stream, turning Renderer in a DOM token stream level interface (i.e. send every single opening tag, attribute, etc to it), and then do the sanitization in a special sanitizing renderer.

But that seems like a pretty invasive change in the code base, I think the HTML5 parser approach is safer and easier.

Vytautas Šaltenis · Answer 18 · Mon Apr 28 2014 04:12:05 GMT+0800 (China Standard Time)

It will still not be completely safe due to different tolerance to deviations from the standard in different browsers. I.e. it will be possible for the attacker to construct a peculiar inline html in a way that will not be recognized as html at all by go.net/html, but will be groked by a real browser and be malicious.

Having said that, I would certainly be interested in looking at the code, enough talking :-)

Martin Probst · Answer 19 · Mon Apr 28 2014 04:16:20 GMT+0800 (China Standard Time)

Yes, you could have a massively misbehaving user agent - but then you're
screwed anyway. All modern browsers follow the html5 parsing algorithm - I
think in practice the approach is safe.

On Sonntag, 27. April 2014 22:12:10, Vytautas Šaltenis <
notifications@github.com> wrote:

It will still not be completely safe due to different tolerance to
deviations from the standard in different browsers. I.e. it will be
possible for the attacker to construct a peculiar inline html in a way that
will not be recognized as html at all by go.net/html, but will be groked
by a real browser and be malicious.

Having said that, I would certainly be interested in looking at the code,
enough talking :-)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/11#issuecomment-41507460
.

Vytautas Šaltenis · Answer 20 · Mon Sep 22 2014 17:50:25 GMT+0800 (China Standard Time)

As discussed in #90, blackfriday itself is not going to provide html sanitization, we're leaving that to dedicated libraries. We might add a convenience function for that if needed, but it doesn't seem to be necessary ATM.

Tim Hutt · Answer 21 · Wed Sep 24 2014 03:22:58 GMT+0800 (China Standard Time)

Seems fair, given that you have added a warning to the readme (thanks)!