How to be compatible with html5lib's sanitization?

Question

How to be compatible with html5lib's sanitization?

sferencik opened this issue 3 years ago · comments

I've been using html5lib's native sanitization, but I'm migrating to Bleach, as recommended in https://github.com/html5lib/html5lib-python/blob/master/CHANGES.rst#11. I'm replacing my previous code:

html5lib.serialize(html5parser.HTMLParser().parseFragment("<u>hi</u>"), sanitize=True)
# -> <u>hi</u>

with this:

bleach.clean("<u>hi</u>")
# -> &lt;u&gt;hi&lt;/u&gt;

Notice how Bleach escapes my <u>hi</u> because the u-tag isn't on its very conservative tokenizer allow-list. I am aware that I could specify my own list of allowed tags, like so:

bleach.clean("<u>hi</u>", tags=MY_LIST_THAT_INCLUDES_U)
# -> <u>hi</u>

but how would I construct MY_LIST_THAT_INCLUDES_U such that it's backward-compatible with what html5lib's sanitizer allows (allowed me until now)? I think I'd need to get at the larger tag set in the vendored html5lib sanitizer. Is that what I'm supposed to do? Is there a neater way to achieve that? Has this come up as a use case previously?

Will Kahn-Greene · Answer 1 · Mon Dec 20 2021 20:38:05 GMT+0800 (China Standard Time)

If you're migrating from one system to another and want to maintain the current behavior of the system you're migrating from, you could copy the list of HTML tags that you believe are safe from html5lib into your code.

You could use the vendored code, but that's not part of Bleach's exposed API and we're not tracking changes for that.

I'm not aware of how many other people are currently migrating from html5lib sanitizer to Bleach. Maybe it's worth putting together a migration guide. Would you want to tackle something like that?

Samuel Ferencik · Answer 2 · Mon Dec 20 2021 21:22:40 GMT+0800 (China Standard Time)

Yes, I could help with the migration guide. Would you want to host it in Bleach, or would you rather see it in html5lib? Would you be open to exposing a compatibility mode in Bleach?

Also, is there a way to not turn special characters into HTML entities? Like when "' becomes "'. That's another incompatibility; I haven't quite got to the bottom of it but I don't suppose there's a way to keep these as "'?

Will Kahn-Greene · Answer 3 · Mon Dec 20 2021 21:35:26 GMT+0800 (China Standard Time)

I don't work on html5lib anymore, so I think this would be a migration guide that lived in the Bleach docs.

I'm not interested in implemented or maintaining a compatibility mode.

I don't know what you mean by your last question. Can you provide some example code?

Samuel Ferencik · Answer 4 · Mon Dec 20 2021 21:46:55 GMT+0800 (China Standard Time)

By my last question, I meant this:

bleach.clean("\"'")
# -> &#34;&#39; instead of "'

Can the caller opt out and keep the quotes?

Will Kahn-Greene · Answer 5 · Mon Dec 20 2021 23:13:11 GMT+0800 (China Standard Time)

What's the use case for not switching to character entities?

Samuel Ferencik · Answer 6 · Tue Dec 21 2021 00:01:46 GMT+0800 (China Standard Time)

Admittedly, not a very strong one. Some regression tests have triggered, but I've already proposed to the test owners to rebase these. I agree this shouldn't be an issue in most rendering contexts - but I don't own these tests and I just wanted to check if perhaps there was a way to not escape the quotes. (This is not a feature request! Just wanted to know in case there was an easy way out.)

... and based on your response, I'm assuming there isn't, correct?

Will Kahn-Greene · Answer 7 · Tue Dec 21 2021 03:02:43 GMT+0800 (China Standard Time)

Bleach sanitizes text for use in an HTML context. Because of that, I think it really should be encoding characters like <, >, ', ", etc. I think it should be encoding those characters. I forget what the sanitizer did. If it sanitized entire documents, that'd be different because Bleach would know what was before and after and could be less strict.

Samuel Ferencik · Answer 8 · Thu Dec 30 2021 22:38:02 GMT+0800 (China Standard Time)

Thanks, makes sense. (In your last sentence, did you mean to say "the sanitizer would know what was before and after"?)

The one thing left then is the migration guide we discussed above. I've created #625.

Will Kahn-Greene · Answer 9 · Mon Jan 03 2022 21:49:44 GMT+0800 (China Standard Time)

The sanitizer is the primary component of Bleach, so I use them interchangeably. Sorry about that.