DMOJ / online-judge

A modern open-source online judge and contest platform system.

Home Page:https://dmoj.ca

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Consider switching from lxml's clean_html for enhanced security (and possibly performance)

frenzymadness opened this issue · comments

I'd like to bring to your attention that we are discussing the possibility of removing lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning security vulnerabilities discovered within the lxml library's clean_html functionality – CVE-2021-43818, CVE-2021-28957, CVE-2020-27783, CVE-2018-19787 and CVE-2014-3146.

The main problem is in the design. Because the lxml's clean_html functionality is based on a blocklist, it's hard to keep it up to date with all new possibilities in HTML and JS.

Two viable alternatives worth considering are bleach and nh3. Here's why:

bleach:

  • Bleach is a widely adopted Python library specifically designed for sanitizing and cleaning HTML input.
  • It has a strong track record in terms of security – it's allowed-list-based.
  • It was deprecated in January but it will still receive security updates, support for new Pythons and bugfixes, see upstream issue.

nh3:

  • nh3 is Python binding for the ammonia library. Ammonia is written in Rust and it's also allowed-list-based.
  • Thanks to the Rust backend, nh3 is also significantly faster than bleach.
  • Rust backend is nothing to be afraid of. nh3 uses the latest PyO3 compatible with Python 3.12 and provides wheels built on top of compatible ABI for different architectures and platforms.

We'll probably move the cleaning part of the lxml to a distinct project first so it will still be possible to use it but better is to find a suitable alternative sooner rather than later.

Let me know if we can help you with this transition anyhow and have a nice day.

clean_html is only used in this migration from 2019:

https://github.com/DMOJ/online-judge/blame/79be4af4b542aadc4fe82d23e0af209d6fe8a23f/judge/migrations/0091_compiler_message_ansi2html.py#L4

Because migrations are run once, security is not a concern. However, the removal of clean_html could cause problems. @Ninjaclasher do you know any alternatives for this use of clean_html?

I don't think we need to worry about this too much. It's probably around time we squashed all our migrations again, which means we won't need this transition code.

If we do want to fix it, all we're trying to do is to get the text given some HTML. I think something like html2text looks promising.

Yes, this is not worth fixing. We will just squash migrations.