ArchiveBox / ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Home Page:https://archivebox.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Architecture: Archived JS executes in a context shared with all other archived content (and the admin UI!)

s7x opened this issue · comments

commented

Describe the bug

Hi there!
There's an XSS vulnerability when you open your index.html if you saved a page with a title containing an XSS vector.

Steps to reproduce

  1. Save this page for example: [Twitter of @garethheyes] ](https://twitter.com/garethheyes/status/1126526480614416395)
  2. Open your index.html
  3. Get XSS'd by sir @garethheyes

Source code:

<a href="archive/1557816881/twitter.com/garethheyes/status/1126526480614416395.html" title="\u2028\u2029 op Twitter: "Another way to use throw without a semi-colon:
<script>{onerror=alert}throw 1</script>"">

Software versions

  • OS: ArchLinux
  • ArchiveBox version: 903.59da482-1
  • Python version: python3.7
  • Chrome version: Chromium 74.0.3729.131 Arch Linux

I'm aware of this already, the reason I haven't immediately locked it down is because archived pages can already run arbitrary Javascript in the context of your archive domain, so there's not much we can do to protect against that attack vector without breaking interactivity across all archived pages. If I add bleach-style XSS stripping to titles it'll make the title page less likely to break from a UX perspective, but it doesn't make it any more secure because archived pages can just request the index page at any time directly using JavaScript.

v0.4 is going to add some security headers that will make it more difficult for pages to use JS to access other archived pages, but it's never going to be perfect unless we have each archived page stored on its own domain.

I'm having long conversations with several people this week about the security model of archivebox, it's a difficult problem but I think we'll have to end up disabling all Javascript in the static HTML archives and only allowing proxy replaying of WARCs if people want interactivity preserved. I'm going to move all the filesystem stuff in to hash-bucketed folders to discourage people from opening the saved html files directly and only accessing them via nginx or the django webserver, as allowing archived JS to have filesystem access is disastrously bad security.

Because this is fairly serious I've temporarily striked-out the instructions for running archivebox with private data: https://github.com/pirate/ArchiveBox/wiki/Security-Overview#important-dont-use-archivebox-for-private-archived-content-right-now-as-were-in-the-middle-of-resolving-some-security-issues-with-how-js-is-executed-in-archived-content

Unfortunately my day job is getting super busy right now so I don't know how soon I can change the design (fixing this is a big architectural change), but I think I might add a notice to the README as well to warn people that running archived pages can leak the index and content in the current state. The primary use case is archiving public pages and feeds so it's not as bad as if it were doing private session archiving by default, but I don't want to give users a false sense of security so we should definitely be transparent about the risks.

commented

Hi @pirate! I know the issue is not that critical when you're using archivebox only locally (like I do) cause you're aware of what you are doing (supposed at least) when you save pages and stuff like this but still I think that some people would be happy to know there's no random JS popping up in their hoarding box :)

Thanks for your time & consideration. And for sure, thanks for this awesome tool.

Cheers!

Why does this only affect title? Is it possible that this XSS opportunity exists elsewhere?

@andrewzigerelli see my comment above. A primary goal of ArchiveBox is to preserve JS and interactivity in archived pages, but that means pages necessarily have to be able to execute their own arbitrary JS.

XSS-stripping titles or any of the other little metadata fields is like putting up a small picket fence to try and stop a tsunami. Why would an attacker bother going to the trouble to stuff some XSS payload in page titles when they can just put JS on the page directly knowing it will be executed by ArchiveBox users on a domain shared with the index and all the other pages? (the whole traditional browser security model breaks without CORS protections. the invisible wall that stops xxxhacker.com from accessing your data on facebook.com is the fact that it's on a different domain, but all archived pages are served from the same domain)

archived pages can already run arbitrary Javascript in the context of your archive domain, so there's not much we can do to protect against that attack vector without breaking interactivity across all archived pages

archived pages can just request the index page at any time directly using JavaScript

Idea h/t for encouragement from @FiloSottile, and similar to how Wikimedia and many other services do it:

  • serve all "dirty" archived content from one port, e.g. 9595. including static archive/<timestamp>/index.html indexes, archived content with live JS, etc. that could be dangerous
  • serve the django admin interface from 9594, with the login screen, ability to add new snapshots, remove URLs, etc. shoudl not be on the same origin as the risky archived content

These can be mapped to separate domains/ports (subdomains are dangerous?maybe, full domains likely required) by the user, but will require adding some new config options to tune what port/domain the admin and dirty content are listening on: e.g.
HTTP_DIRTY_LISTEN=https://demousercontent.archivebox.io
HTTP_ADMIN_LISTEN=https://demo.archivebox.io

This would close a pretty crucial security hole where archived content can mess with the execution of extractors (and potentially run abitrary shell scripts if they chain together a series of injection attacks).

Semi-Related, using sandbox iframes for replay: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Sec-Fetch-Mode

Extractor methods that replay JS:

  • wget
  • dom

Proposed behavior:

  • if dirty content is loaded fromw within iframe (with sandbox protections): allow JS, because iframe sandboxes protect us (verify this first)
  • if dirty content is loaded outside an iframe (e.g. if someone visits the URL directly): serve strict CSP/CORS headers to prevent JS execution entirely
  • prevent right clicking the iframe to get the unsafe url and open it in a new tab directly ? or detect server side if dirty url is visited outside an iframe and prevent it?

config option to enable bypassing sandboxing:

  • DANGER_ALLOW_BYPASSING_SANDOX=True/False
  • once enabled ^ checkbox appears on a per-snapshot basis that allows disabling iframe/csp sandbox protections when replaying that snapshot

I talked about the ArchiveBox scenario with a couple experts, and we came up with a better option than <iframe sandbox>: Content-Security-Policy: sandbox, which instructs the browser to treat the load as its own unique origin.

This is much more robust and convenient than detecting iframe loads.

We also went through the list of security headers to pick the ones that would protect ArchiveBox pages from Spectre, too. They should involve no maintenance.

Content-Security-Policy: sandbox
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp
Cross-Origin-Resource-Policy: same-origin [FOR HTML] / cross-origin [FOR NOT HTML]
Vary: Sec-Fetch-Site
X-Content-Type-Options: nosniff

On top of that, it would still be a good idea to have the admin API on a different origin (a different subdomain is enough), and make its cookie SameSite=Strict.

This should stop any cross-contamination between archived pages, but it won't stop them from detecting other archived pages. That might be possible, but it will require more complex server logic.

Hi! Sorry to post on such an old issue, just wondering if this is going to be implemented? Would love to be able to use WARC instead of SingleFile