uriparser / uriparser

:hocho: Strictly RFC 3986 compliant URI parsing and handling library written in C89; moved from SourceForge to GitHub

Home Page:https://uriparser.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

uriEscapeW broken on Windows?

ahajishafieha opened this issue · comments

Hello.
It appears that the current implementation of EscapeExW function does not fully work with wchar_t character strings in windows.
This is probably caused by the weird 16-bit implementation of wchar_t type in windows, where some characters are broken into 2 half-characters.

Test case provided and tested with MingW and MSVC on Windows 10:

#include <stdio.h>
#include <wchar.h>
#include <uriparser/Uri.h>
#include <fcntl.h>
#include <io.h>

int main()
{
	_setmode(_fileno(stdout), _O_U8TEXT);
	wchar_t original[] = L"こんにちは";
	wchar_t encoded[64] = { 0 };
	wchar_t decoded[64] = { 0 };
	uriEscapeW(original, encoded, URI_FALSE, URI_FALSE);
	wcscpy(decoded, encoded);
	uriUnescapeInPlaceW(decoded);
	wprintf(L"original: %ls\nencoded: %ls\ndecoded: %ls\n", original, encoded, decoded);
	return 0;
}

Resulting output:

original: こんにちは
encoded: %53%93%6B%61%6F
decoded: Skao

Expected output:

original: こんにちは
encoded: %E3%81%93%E3%82%93%E3%81%AB%E3%81%A1%E3%81%AF
decoded: こんにちは

Hi @ahajishafieha,

the API docs of function uriEscapeA and uriEscapeW states that the function…

…[p]ercent-encodes all unreserved characters from the input string…

…whereas "unreserved characters" is a term and character set defined by RFC 3986 section 2.3. That set does not include the characters you mention, you probably think IRI rather than URI. While it's probably not the hardest to implement, uriparser is about URIs (RFC 3986) not IRIs (RFC 3987) and hence implementing it here would be out of place and would be a backwards-incompatible change also, both in terms of producing different input than in the past but also in terms of space requirements for the output buffer.

If this is the only place where you need IRI-ish behavior with URI handline, you may be fine with using a hand-made substitute to uriEscapeW in your code base. If however you need full IRI support anyway, you would need a whole different library.

What do you think?

Thank you for your response.
I will admit that I have not fully read the RFCs for neither URI nor IRI. However, and correct me if I'm mistaken, my understanding is that any non-ascii character must be percent-encoded to produce a valid URI.
As per RFC 3986 Section 2.1:

A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component.

My understanding is that all non-ascii characters (which won't be categorized as either reserved or unreserved per sections 2.2 & 2.3) need to be percent-encoded.

So going by an example pulled from wikipedia:

Example: The IRI https://en.wiktionary.org/wiki/Ῥόδος becomes the URI https://en.wiktionary.org/wiki/%E1%BF%AC%CF%8C%CE%B4%CE%BF%CF%82

Which is what I've been grown accustomed to when dealing with URIs.

My current dirty workaround is converting the wchar_t string a char string and using uriEscapeA function, then converting the result back into a wchar_t string. The end result is percent-encoded into what I was expecting this way.

To reiterate, I'm expecting the string "こんにちは" to be encoded into "%E3%81%93%E3%82%93%E3%81%AB%E3%81%A1%E3%81%AF", which does not happen with uriEscapeW, but it does happen with uriEscapeA.

In either case, even if it breaks backwards-compatibility, the fact that uriEscapeW & uriEscapeA behave differently given the same input, should be worth consideration.
Furthermore, in the current implementation, if the input passed to uriEscapeW contains any such characters, the end result can't be considered anything other than merely corrupted. as it doesn't represent the original input in any shape. Is that the intended behaviour?

@ahajishafieha it seems important to note that because RFC 3986 uses a subset of ASCII, uriparser uses char and wchar_t in a way where every unit must be one code point from the ASCII space: more a list of raw code points as integers rather than an encoding. If uriparser would want to transform arbitrary wchar_t beyond ASCII to percent encoding, it would:

  • need to take endianess into account,
  • need to take encoding into the account,
  • need to take variance in size of wchar_t accross platforms into account.

That list hopefully shows what kind of box we'd be opening here.

In think there are two potential ways forward:

  • a) add a new function uriEscapeExExW that handles cases like yours and many others
  • b) extend documentation be more clear that this scenario is not supported.

(a) does not have a good cost-to-benefit ratio, (b) can be done but probably doesn't help you much? Are you aware of any other options?

Yeah, I did some research and I think converting input from UTF-16 wchar_t to UTF-8 before percent encoding them would be too much work to implement if we don't want stray from ANSI C.

I think the best solution is to warn the users (especially on windows platform) that uriparser only accepts UTF-8 inputs in its wide char functions and maybe add a suggestion to use "WideCharToMultiByte" Win32 api function to convert strings from UTF-16 to UTF-8 prior to calling uriparser functions on this platform.

Thank you

@ahajishafieha I have created a pull request #175 to improve on the documentation. I am not at home on Windows and I expect that developers on Windows will find the functions needed to do text encoding conversions by themselves given how important they are, so I did not mention WideCharToMultiByte on purpose to not be wrong and to not be too much about Windows since any platform using wchar_t would be affected here, including Linux. With that out of the way, is #175 good enough in your opinion?