Failed to parse non ASCII urls

Question

Failed to parse non ASCII urls

artemyarulin opened this issue 4 years ago · comments

Hi, according to docs UTF should be supported, but following code with std::string or std::wstring are failing.

std::string exampleS = "http://ä.com";
UriUriA uriS;
const char *errorPosS;
if (uriParseSingleUriA(&uriS, exampleS.c_str(), &errorPosS) == URI_SUCCESS) {
  return "OK";
}
    
std::wstring exampleW = L"http://ä.com";
UriUriW uriW;
const wchar_t *errorPosW;
if (uriParseSingleUriW(&uriW, exampleW.c_str(), &errorPosW) == URI_SUCCESS) {
  return "OK";
}

Wonder if I'm doing something wrong here? If URL is encoded to http://%C3%A4.com it works just fine, but I guess decoded URLs should be supported as well?

Thanks

Sebastian Pipping · Answer 1 · Wed Apr 08 2020 07:02:53 GMT+0800 (China Standard Time)

Hi!

I should fix the use of the word "Unicode" in the docs. You cannot have a verbatim ä in a URI and conform to RFC 3986 at the same time. Percent-encoding may be an option inside URIs but be aware that the software putting encoded ä in and the software decoding back to ä need to agree about the specific encoding, because who says that 0xc3 followed by 0xa4 is not U+00c3 followed by U+00a4?

By any chance, are you looking for IRIs/RFC-3987 or Punycode?

Best, Sebastian

Artem Yarulin · Answer 2 · Wed Apr 08 2020 14:39:57 GMT+0800 (China Standard Time)

Thanks for your reply and links - I'll have a look.

Basically I just have a stream of URLs coming from the crawler and there I'm doing normalisation (uriNormalizeSyntaxW, uriUnescapeInPlaceW, etc.) to convert all URL to one common format so I can search for duplicates and process it further. It kinda works, if we ignore UTF issue, but problem is that after such normalisation I cannot parse URL again using uriparser.

I guess I should use different common format.

Thanks again for your support