Failed to parse non ASCII urls
artemyarulin opened this issue · comments
Hi, according to docs UTF should be supported, but following code with std::string
or std::wstring
are failing.
std::string exampleS = "http://ä.com";
UriUriA uriS;
const char *errorPosS;
if (uriParseSingleUriA(&uriS, exampleS.c_str(), &errorPosS) == URI_SUCCESS) {
return "OK";
}
std::wstring exampleW = L"http://ä.com";
UriUriW uriW;
const wchar_t *errorPosW;
if (uriParseSingleUriW(&uriW, exampleW.c_str(), &errorPosW) == URI_SUCCESS) {
return "OK";
}
Wonder if I'm doing something wrong here? If URL is encoded to http://%C3%A4.com
it works just fine, but I guess decoded URLs should be supported as well?
Thanks
Hi!
I should fix the use of the word "Unicode" in the docs. You cannot have a verbatim ä
in a URI and conform to RFC 3986 at the same time. Percent-encoding may be an option inside URIs but be aware that the software putting encoded ä
in and the software decoding back to ä
need to agree about the specific encoding, because who says that 0xc3 followed by 0xa4 is not U+00c3 followed by U+00a4?
By any chance, are you looking for IRIs/RFC-3987 or Punycode?
Best, Sebastian
Thanks for your reply and links - I'll have a look.
Basically I just have a stream of URLs coming from the crawler and there I'm doing normalisation (uriNormalizeSyntaxW, uriUnescapeInPlaceW, etc.) to convert all URL to one common format so I can search for duplicates and process it further. It kinda works, if we ignore UTF issue, but problem is that after such normalisation I cannot parse URL again using uriparser.
I guess I should use different common format.
Thanks again for your support