aio-libs / yarl

Yet another URL library

Home Page:https://yarl.aio-libs.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

IP address test regressions (Python 3.11.4, 3.12.0b1)

mgorny opened this issue · comments

Describe the bug

When running the test suite under Python 3.12, a few tests fail:

FAILED tests/test_url.py::test_ipv6_zone - ValueError: 'fe80::822a:a8ff:fe49:470c%тест%42' does not appear to be an IPv4 or IPv6 address
FAILED tests/test_url.py::test_human_repr_delimiters - ValueError: '\\' does not appear to be an IPv4 or IPv6 address
FAILED tests/test_url_parsing.py::TestHost::test_masked_ipv4 - ValueError: An IPv4 address cannot be in brackets
FAILED tests/test_url_parsing.py::TestHost::test_strange_ip - ValueError: '-1' does not appear to be an IPv4 or IPv6 address
FAILED tests/test_url_parsing.py::TestUserInfo::test_weird_user3 - ValueError: 'some' does not appear to be an IPv4 or IPv6 address

(full traceback below)

To Reproduce

Run python3.12 -m pytest ;-).

Expected behavior

Passing tests ;-).

Logs/tracebacks

============================================================== FAILURES ===============================================================
___________________________________________________________ test_ipv6_zone ____________________________________________________________

    def test_ipv6_zone():
>       url = URL("http://[fe80::822a:a8ff:fe49:470c%тест%42]:123")

tests/test_url.py:239: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.venv/lib/python3.12/site-packages/yarl/_url.py:172: in __new__
    val = urlsplit(val)
/usr/lib/python3.12/urllib/parse.py:500: in urlsplit
    _check_bracketed_host(bracketed_host)
/usr/lib/python3.12/urllib/parse.py:446: in _check_bracketed_host
    ip = ipaddress.ip_address(hostname) # Throws Value Error if not IPv6 or IPv4
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

address = 'fe80::822a:a8ff:fe49:470c%тест%42'

    def ip_address(address):
        """Take an IP string/int and return an object of the correct type.
    
        Args:
            address: A string or integer, the IP address.  Either IPv4 or
              IPv6 addresses may be supplied; integers less than 2**32 will
              be considered to be IPv4 by default.
    
        Returns:
            An IPv4Address or IPv6Address object.
    
        Raises:
            ValueError: if the *address* passed isn't either a v4 or a v6
              address
    
        """
        try:
            return IPv4Address(address)
        except (AddressValueError, NetmaskValueError):
            pass
    
        try:
            return IPv6Address(address)
        except (AddressValueError, NetmaskValueError):
            pass
    
>       raise ValueError(f'{address!r} does not appear to be an IPv4 or IPv6 address')
E       ValueError: 'fe80::822a:a8ff:fe49:470c%тест%42' does not appear to be an IPv4 or IPv6 address

/usr/lib/python3.12/ipaddress.py:54: ValueError
_____________________________________________________ test_human_repr_delimiters ______________________________________________________

    def test_human_repr_delimiters():
        url = URL.build(
            scheme="http",
            user=" !\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~",
            password=" !\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~",
            host="хост.домен",
            port=8080,
            path="/ !\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~",
            query={
                " !\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~": " !\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~"
            },
            fragment=" !\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~",
        )
        s = url.human_repr()
>       assert URL(s) == url

tests/test_url.py:1630: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.venv/lib/python3.12/site-packages/yarl/_url.py:172: in __new__
    val = urlsplit(val)
/usr/lib/python3.12/urllib/parse.py:500: in urlsplit
    _check_bracketed_host(bracketed_host)
/usr/lib/python3.12/urllib/parse.py:446: in _check_bracketed_host
    ip = ipaddress.ip_address(hostname) # Throws Value Error if not IPv6 or IPv4
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

address = '\\'

    def ip_address(address):
        """Take an IP string/int and return an object of the correct type.
    
        Args:
            address: A string or integer, the IP address.  Either IPv4 or
              IPv6 addresses may be supplied; integers less than 2**32 will
              be considered to be IPv4 by default.
    
        Returns:
            An IPv4Address or IPv6Address object.
    
        Raises:
            ValueError: if the *address* passed isn't either a v4 or a v6
              address
    
        """
        try:
            return IPv4Address(address)
        except (AddressValueError, NetmaskValueError):
            pass
    
        try:
            return IPv6Address(address)
        except (AddressValueError, NetmaskValueError):
            pass
    
>       raise ValueError(f'{address!r} does not appear to be an IPv4 or IPv6 address')
E       ValueError: '\\' does not appear to be an IPv4 or IPv6 address

/usr/lib/python3.12/ipaddress.py:54: ValueError
______________________________________________________ TestHost.test_masked_ipv4 ______________________________________________________

self = <test_url_parsing.TestHost object at 0x7f36ce6d2510>

    def test_masked_ipv4(self):
>       u = URL("//[127.0.0.1]/")

tests/test_url_parsing.py:182: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.venv/lib/python3.12/site-packages/yarl/_url.py:172: in __new__
    val = urlsplit(val)
/usr/lib/python3.12/urllib/parse.py:500: in urlsplit
    _check_bracketed_host(bracketed_host)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

hostname = '127.0.0.1'

    def _check_bracketed_host(hostname):
        if hostname.startswith('v'):
            if not re.match(r"\Av[a-fA-F0-9]+\..+\Z", hostname):
                raise ValueError(f"IPvFuture address is invalid")
        else:
            ip = ipaddress.ip_address(hostname) # Throws Value Error if not IPv6 or IPv4
            if isinstance(ip, ipaddress.IPv4Address):
>               raise ValueError(f"An IPv4 address cannot be in brackets")
E               ValueError: An IPv4 address cannot be in brackets

/usr/lib/python3.12/urllib/parse.py:448: ValueError
______________________________________________________ TestHost.test_strange_ip _______________________________________________________

self = <test_url_parsing.TestHost object at 0x7f36ce6d2750>

    def test_strange_ip(self):
>       u = URL("//[-1]/")

tests/test_url_parsing.py:198: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.venv/lib/python3.12/site-packages/yarl/_url.py:172: in __new__
    val = urlsplit(val)
/usr/lib/python3.12/urllib/parse.py:500: in urlsplit
    _check_bracketed_host(bracketed_host)
/usr/lib/python3.12/urllib/parse.py:446: in _check_bracketed_host
    ip = ipaddress.ip_address(hostname) # Throws Value Error if not IPv6 or IPv4
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

address = '-1'

    def ip_address(address):
        """Take an IP string/int and return an object of the correct type.
    
        Args:
            address: A string or integer, the IP address.  Either IPv4 or
              IPv6 addresses may be supplied; integers less than 2**32 will
              be considered to be IPv4 by default.
    
        Returns:
            An IPv4Address or IPv6Address object.
    
        Raises:
            ValueError: if the *address* passed isn't either a v4 or a v6
              address
    
        """
        try:
            return IPv4Address(address)
        except (AddressValueError, NetmaskValueError):
            pass
    
        try:
            return IPv6Address(address)
        except (AddressValueError, NetmaskValueError):
            pass
    
>       raise ValueError(f'{address!r} does not appear to be an IPv4 or IPv6 address')
E       ValueError: '-1' does not appear to be an IPv4 or IPv6 address

/usr/lib/python3.12/ipaddress.py:54: ValueError
____________________________________________________ TestUserInfo.test_weird_user3 ____________________________________________________

self = <test_url_parsing.TestUserInfo object at 0x7f36ce6d31d0>

    def test_weird_user3(self):
>       u = URL("//[some]@host")

tests/test_url_parsing.py:323: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.venv/lib/python3.12/site-packages/yarl/_url.py:172: in __new__
    val = urlsplit(val)
/usr/lib/python3.12/urllib/parse.py:500: in urlsplit
    _check_bracketed_host(bracketed_host)
/usr/lib/python3.12/urllib/parse.py:446: in _check_bracketed_host
    ip = ipaddress.ip_address(hostname) # Throws Value Error if not IPv6 or IPv4
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

address = 'some'

    def ip_address(address):
        """Take an IP string/int and return an object of the correct type.
    
        Args:
            address: A string or integer, the IP address.  Either IPv4 or
              IPv6 addresses may be supplied; integers less than 2**32 will
              be considered to be IPv4 by default.
    
        Returns:
            An IPv4Address or IPv6Address object.
    
        Raises:
            ValueError: if the *address* passed isn't either a v4 or a v6
              address
    
        """
        try:
            return IPv4Address(address)
        except (AddressValueError, NetmaskValueError):
            pass
    
        try:
            return IPv6Address(address)
        except (AddressValueError, NetmaskValueError):
            pass
    
>       raise ValueError(f'{address!r} does not appear to be an IPv4 or IPv6 address')
E       ValueError: 'some' does not appear to be an IPv4 or IPv6 address

/usr/lib/python3.12/ipaddress.py:54: ValueError

----------- coverage: platform linux, python 3.12.0-beta-1 -----------
Name                                                     Stmts   Miss Branch BrPart  Cover
------------------------------------------------------------------------------------------
.venv/lib/python3.12/site-packages/yarl/__init__.py          3      0      0      0   100%
.venv/lib/python3.12/site-packages/yarl/_quoting.py         10      2      4      1    79%
.venv/lib/python3.12/site-packages/yarl/_quoting_py.py     155      0     68      0   100%
.venv/lib/python3.12/site-packages/yarl/_url.py            604      2    364      0    99%
------------------------------------------------------------------------------------------
TOTAL                                                      772      4    436      1    99%

======================================================= short test summary info =======================================================
FAILED tests/test_url.py::test_ipv6_zone - ValueError: 'fe80::822a:a8ff:fe49:470c%тест%42' does not appear to be an IPv4 or IPv6 address
FAILED tests/test_url.py::test_human_repr_delimiters - ValueError: '\\' does not appear to be an IPv4 or IPv6 address
FAILED tests/test_url_parsing.py::TestHost::test_masked_ipv4 - ValueError: An IPv4 address cannot be in brackets
FAILED tests/test_url_parsing.py::TestHost::test_strange_ip - ValueError: '-1' does not appear to be an IPv4 or IPv6 address
FAILED tests/test_url_parsing.py::TestUserInfo::test_weird_user3 - ValueError: 'some' does not appear to be an IPv4 or IPv6 address
============================================== 5 failed, 1095 passed, 2 xfailed in 5.28s ==============================================

Python Version

$ python --version
Python 3.12.0b1

multidict Version

$ python -m pip show multidict
Name: multidict
Version: 6.0.4
Summary: multidict implementation
Home-page: https://github.com/aio-libs/multidict
Author: Andrew Svetlov
Author-email: andrew.svetlov@gmail.com
License: Apache 2
Location: /tmp/yarl/.venv/lib/python3.12/site-packages
Requires: 
Required-by: yarl

yarl Version

$ python -m pip show yarl
Name: yarl
Version: 1.9.2
Summary: Yet another URL library
Home-page: https://github.com/aio-libs/yarl/
Author: Andrew Svetlov
Author-email: andrew.svetlov@gmail.com
License: Apache-2.0
Location: /tmp/yarl/.venv/lib/python3.12/site-packages
Requires: idna, multidict
Required-by:

OS

Gentoo Linux amd64

Additional context

Confirmed on 723a5ba.

Code of Conduct

  • I agree to follow the aio-libs Code of Conduct

I'll be tackling these as part of #881. Note that these regressions will also affect the next 3.11 release as these changes are part of a bug fix that also lands in the 3.11 branch.

Actually, I'll not tackle this as part of #881, as this is not just a 3.12 regression. 3.11.4 was slated to be released yesterday, I'm assuming that the release is imminent, and then the same errors will start cropping up in the 3.11 tests.

We'll need to validate each of the IP addresses that are now causing issues; e.g. the test_ipv6_zone test is indeed using an invalid IPv6 address, the zone scope identifier can't have a % character in the name (so fe80::822a:a8ff:fe49:470c%тест is fine, fe80::822a:a8ff:fe49:470c%тест%42 is not).

FWIW, the WHATWG URL specification intentionally omits IPv6 zone support, as these are host specific. I'm not saying we should outright reject IPv6 addresses with a zone scope identifier, but I don't think we need to go out of our way to make all possible scope labels work, either.

Well, in case you are looking for my input, I'm afraid I can't help much. I'm merely in the role of a packager (for Gentoo) here, and I don't know anything about yarl, besides the fact that some other packages need it.

Invalid tests

Of the failing tests, the following look to be invalid to me:

  • test_url_parsing.py::TestUserInfo::test_weird_user3 uses [...] square brackets in the username portion of the URL, but the [ character must be percent encoded: URL('//%5Bsome%5D@host').
  • test_url_parsing.py::TestHost::test_masked_ipv4: brackets around 127.0.0.1 are invalid, [ and ] can only ever be used in hosts that are valid IPv6 addreses. At best the brackets would need to be URL encoded. I think the test should just be dropped, there is no such thing as a 'masked IPv4 address', and //[127.0.0.1]/ is not equivalent to //127.0.0.1/.
  • test_url_parsing.py::TestHost::test_strange_ip: same issue, [-1] is not a valid hostname because it uses [ and ] without an IPv6 address in between. This test too should just be dropped, the test outcome expects the host value to be -1, so //-1/ should have been used.
  • test_url.py::test_ipv6_zone: The IPv6 address validity is ambigous, see below for what I found when I researched this. Bottom line is use of %42 in the zone scope in fe80::822a:a8ff:fe49:470c%тест%42 makes this address unparseable, and I don't think we even need to support IPv6 zone ids in URLs in the first place.

IPv6 Zone ID strings in IPv6 addresses inside URLs

test_url.py::test_ipv6_zone uses an IPv6 address with an explicit zone identifier, as per RFC 4007: [fe80::822a:a8ff:fe49:470c%тест%42].

However, the zone id portion of this string is, at best, ambiguous, and at worst, just invalid. The zone id is either тест%42 (containing a % character) or was expected to be percent decoded to тестB, but in the latter case I'd have expected the first % character to be encoded to %25 rather than be included literally. Either way, the zone_id label portion conflicts with existing standards:

An implementation MAY support other kinds of non-null strings as
<zone_id>. However, the strings must not conflict with the delimiter
character.

(bold emphasis mine). This makes the тест%42 value invalid, and this is why the ipaddress.ipaddress('fe80::822a:a8ff:fe49:470c%тест%42') call raises an exception.

Note: Support for <zone_id> is intentionally omitted.

It is intentionally omitted because zone ids are local to the current machine, and because of the implementation-dependent nature of the id string.

As a result, the WHATWG URL host parser description does not apply percent decoding to the (expected) IPv6 string found between square brackets.

  • A draft RFC for representing IPv6 addresses in URIs states that the % character delimiting the IPv6 address and the zone ID should be encoded to %25, which the test value doesn't do. If the intention was to have %42 be percent-decoded, then the correct host string would be [fe80::822a:a8ff:fe49:470c%25тест%42]. Since urllib.parse.urlsplit() won't percent decode this, we'd have special-case IPv6 addresses before calling spliturl().

Because the Python implementation follows WHATWG and doesn't apply percent decoding I am strongly inclined to just remove the test altogether. We don't have to explicitly raise an error if you have a zone ID identifier, but I'd rather not have to special case these either. At best we should remove the %42 portion from the test string and add a comment explaining that zone id support is minimal, best effort only.

Note on URL() accepting 'unencoded' values

URL(string) is supposed to accept "unencoded" values. However, at this stage the difference between URL(string) and URL(string, encoded=True) is, by my reading, only supposed to make a difference when using non-ascii URLs that are otherwise valid; encoded=False will not magically make ambiguous strings with square brackets in the netloc portion of the URL parseable. That's what the URL.build() class method is for.

Actual bug: Human-readable URLs

The final failing test is test_url.py::test_human_repr_delimiters, which attempts to create a human-readable version of a URL with both the username and the password containing square brackets.

The fix is to add [ and ] to the set of characters that _human_quote() should percent encode. We should probably reference the WHATWG URL spec section on percent encoding bytes. That section explicitly lists what codepoints in what components should be percent encoded. I think that would work better, provided we still omit codepoints over U+007E (~).

Renaming some of the tests, dropping some others.

The test suite includes more somewhat weird tests using square brackets, test_url_parsing.py::TestUserInfo::test_strange_ip_2 and test_url_parsing.py::TestUserInfo::test_strange_ip_3, which do pass.

The first one parses a URL with a IPvFuture ip addresses as specified in RFC 3986 section 3.2.2 and so are not 'strange'. I'll rename the test.

The second test uses the invalid hostname v1.[::1], which urlsplit() really should just refuse to parse but doesn't. Instead, the parser just.. ignores the part before the opening [, so hostname is ::1 but netloc is v1.[::1]. I consider this a bug in urllib.parse. I'm dropping the test, as it is not sensible behaviour.

Well, in case you are looking for my input, I'm afraid I can't help much. I'm merely in the role of a packager (for Gentoo) here, and I don't know anything about yarl, besides the fact that some other packages need it.

Not a problem! I'm just documenting my thoughts on this, to facilitate discussion and future reference when people come asking why their specific URL no longer is being accepted or why behaviour changed.