apostrophecms / sanitize-html

Clean up user-submitted HTML, preserving whitelisted elements and whitelisted attributes on a per-element basis. Built on htmlparser2 for speed and tolerance

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot use protocol-relative URL in script src attribute

ronosm opened this issue · comments

It's not possible to add a protocol-relative URL in the src attribute from script tag.

At naughtyHref function it is handled, but after that this code is executed:

const parsed = new URL(value);

This code generates an exception when a protocol-relative URL is used, generating an exception making allowed value as false:

} catch (e) {
allowed = false;
}

Finally, it makes the src attribute is removed from script tag.

This happens between lines 325 and 355.

+1

here:

https://github.com/apostrophecms/sanitize-html/blob/3cdc262/index.js#L333-L338

  • there's some discussion over "protocol-relative" URLs and apparently they're considered an anti-pattern.
  • but surely referencing an https:// resource from http:// origin creates CORS issues.
  • these URLs are supported in the browser and legacy HTML content is full of them

while new URL('//my.url') breaks without protocol, i don't think it should be up to sanitize-html to enforce the scheme of URLs when no allowedScriptHostnames is defined?

i.e. const parsed = new URL(value); should move inside the if block?

https://github.com/apostrophecms/sanitize-html/blob/3cdc262/index.js#L340-L348

edit: that probably applies to iframe src too

tag @yorickgirard

Is this happening in the browser only? I believe the WHATWG URL parser in nodejs supports it.

A small subclass of the URL class to work around this issue probably wouldn't be difficult to contribute as a PR. The idea being to stub in the https: protocol but then stub it out again in toString if it was stubbed in.

(As a PR on this module that is. Modifying URL upstream is unrealistic of course.)

@boutell , I guess what I'm saying is that in the absence of explicit allow-list of domains, it shouldn't even try to police URLs.

$ node
Welcome to Node.js v16.14.0.
Type ".help" for more information.

> new URL('//google.com')
Uncaught TypeError [ERR_INVALID_URL]: Invalid URL
    at __node_internal_captureLargerStackTrace (node:internal/errors:464:5)
    at new NodeError (node:internal/errors:371:5)
    at onParseError (node:internal/url:552:9)
    at new URL (node:internal/url:628:5) {
  input: '//google.com',
  code: 'ERR_INVALID_URL'
}

> new URL('ftp://google.com');
URL {
  href: 'ftp://google.com/',
  origin: 'ftp://google.com',
  protocol: 'ftp:',
  username: '',
  password: '',
  host: 'google.com',
  hostname: 'google.com',
  port: '',
  pathname: '/',
  search: '',
  searchParams: URLSearchParams {},
  hash: ''
}
> 

I don't think it would be a good idea to accept invalid URLs, but I agree that protocol relative URLs should not be considered invalid, at least by default, at least not yet. This is why I'm suggesting using a subclass wrapper for URL that accepts that particular case without reimplementing URL.

it's just that till a few updates ago, sanitize-html didn't have this issue, right?

and in the browser, where the sanitized html ends up, protocol-relative URLs are not invalid.

src has always been passed through naughtyHref no matter what the tag is. I think what we're seeing is that Node 16 now has the same strict policy on protocol relative URLs that is enforced by Safari. A more tolerant subclass of Url would resolve it.