aio-libs / yarl

Yet another URL library

Home Page:https://yarl.aio-libs.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Non-ASCII hostnames are not validated and so allow "@" and other reserved gen-delims characters

mjpieters opened this issue · comments

Describe the bug

For #880 we fixed handling of ASCII hostnames, rejecting hostnames that contain characters or sequences that are explicitly excluded (see RFC3986, section 3.2.2, the reg-name grammar rule.

When implementing the PR for this I had tested idna.encode(host, uts46=True) and verified that it correctly rejects hostnames that use ASCII characters outside of the reg-name rule, not realising that _idna_encode() catches the exception raised for this and then falls back to host.encode('idna'), which doesn't reject such hostnames.

The exception handling is there to allow for IDNA 2003 / 2008 compatibility (see #152). I suspect that we need to use idna.encode(host) instead of host.encode('idna') here to ensure that invalid characters are still rejected.

Because there is no validation it is trivial to create invalid URLs, e.g. by passing in a non-ascii authority value to host plus a user or password value.

To Reproduce

>>> import yarl
>>> yarl.URL.build(scheme="http", host="user:pass@историк.рф", user="therealusername")
URL('http://therealusername@xn--user:pass@-rtia0a8c4aor.xn--p1ai')

Expected behavior

The user:pass@историк.рф host value contains invalid characters, : and @ and so a ValueError should have been raised.

I suspect that we need to use idna.encode(host) instead of host.encode('idna') here to ensure that invalid characters are still rejected.

Nope, that's not correct. Instead, we should perhaps just reject IDNA 2003 support? I'm not versed in IDNA RFCs so it'll take me some time to figure this out.

@asvetlov: can you weigh in on this? Why do we need to be able to encode host strings using IDNA 2003 if IDNA 2008 encoding fails?

I have now found and read the Further Notes section of the Python idna project. I think that the IDNA encoder is implementing the recommendation there:

For now, applications that need to support these non-compliant labels may wish to consider trying the encode/decode operation in this library first, and then falling back to using encodings.idna

If this is the reason, then we'll need to explicitly handle non-compiant characters in the host separately.

Alternatively, we could decide to reject non-IDNA 2008 hostnames.

The user:pass@историк.рф host

Oh boi. We should replace that with something adequate. It's a resource praising the r*ssian colonialism and related imperialist propaganda 🤮. I opened it just to make sure and immediately got a panic attack that will take days to recover from :(

These problems are probably why requests appear to have just rejected it:
psf/requests#5845

Although, as that issue states, several TLDs choose not to enforce IDNA 2008 for their domain registries (and anybody could used them for subdomains), so it's not exactly forbidden globally either.