mozilla / bleach

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes

Home Page:https://bleach.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to be compatible with html5lib's sanitization?

sferencik opened this issue · comments

I've been using html5lib's native sanitization, but I'm migrating to Bleach, as recommended in https://github.com/html5lib/html5lib-python/blob/master/CHANGES.rst#11. I'm replacing my previous code:

html5lib.serialize(html5parser.HTMLParser().parseFragment("<u>hi</u>"), sanitize=True)
# -> <u>hi</u>

with this:

bleach.clean("<u>hi</u>")
# -> &lt;u&gt;hi&lt;/u&gt;

Notice how Bleach escapes my <u>hi</u> because the u-tag isn't on its very conservative tokenizer allow-list. I am aware that I could specify my own list of allowed tags, like so:

bleach.clean("<u>hi</u>", tags=MY_LIST_THAT_INCLUDES_U)
# -> <u>hi</u>

but how would I construct MY_LIST_THAT_INCLUDES_U such that it's backward-compatible with what html5lib's sanitizer allows (allowed me until now)? I think I'd need to get at the larger tag set in the vendored html5lib sanitizer. Is that what I'm supposed to do? Is there a neater way to achieve that? Has this come up as a use case previously?

If you're migrating from one system to another and want to maintain the current behavior of the system you're migrating from, you could copy the list of HTML tags that you believe are safe from html5lib into your code.

You could use the vendored code, but that's not part of Bleach's exposed API and we're not tracking changes for that.

I'm not aware of how many other people are currently migrating from html5lib sanitizer to Bleach. Maybe it's worth putting together a migration guide. Would you want to tackle something like that?

Yes, I could help with the migration guide. Would you want to host it in Bleach, or would you rather see it in html5lib? Would you be open to exposing a compatibility mode in Bleach?

Also, is there a way to not turn special characters into HTML entities? Like when "' becomes &#34;&#39;. That's another incompatibility; I haven't quite got to the bottom of it but I don't suppose there's a way to keep these as "'?

I don't work on html5lib anymore, so I think this would be a migration guide that lived in the Bleach docs.

I'm not interested in implemented or maintaining a compatibility mode.

I don't know what you mean by your last question. Can you provide some example code?

By my last question, I meant this:

bleach.clean("\"'")
# -> &#34;&#39; instead of "'

Can the caller opt out and keep the quotes?

What's the use case for not switching to character entities?

Admittedly, not a very strong one. Some regression tests have triggered, but I've already proposed to the test owners to rebase these. I agree this shouldn't be an issue in most rendering contexts - but I don't own these tests and I just wanted to check if perhaps there was a way to not escape the quotes. (This is not a feature request! Just wanted to know in case there was an easy way out.)

... and based on your response, I'm assuming there isn't, correct?

Bleach sanitizes text for use in an HTML context. Because of that, I think it really should be encoding characters like <, >, ', ", etc. I think it should be encoding those characters. I forget what the sanitizer did. If it sanitized entire documents, that'd be different because Bleach would know what was before and after and could be less strict.

Thanks, makes sense. (In your last sentence, did you mean to say "the sanitizer would know what was before and after"?)

The one thing left then is the migration guide we discussed above. I've created #625.

The sanitizer is the primary component of Bleach, so I use them interchangeably. Sorry about that.