uriparser / uriparser

:hocho: Strictly RFC 3986 compliant URI parsing and handling library written in C89; moved from SourceForge to GitHub

Home Page:https://uriparser.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Trailing slash before '?' results in library not detecting end of path?

scurtis142 opened this issue · comments

Hi there,
When you add a trailing slash at the end of the URI path component, the parse function will successfully parse it, however, the
pointer uri.pathTail->text.afterLast does not point to the next character, instead it points to the allocated string 'X'.

I can't find anything in the RFC 3986 that says a trailing slash at the end of the URI path component is invalid.

This bug was found using version 0.9.5.

The below example explains more clearly. strA and strB differ only in the extra '/' before the '?'.

The first asserteq_str passes but the second one fails.

   test ("pathTail") {
      uri_t uriA, uriB;
      const char *const strA = "https://example.com/path?key=value";
      const char *const strB = "https://example.com/path/?key=value";

      asserteq (uriParseSingleUriA (&uriA, strA, NULL), URI_SUCCESS);
      asserteq (uriParseSingleUriA (&uriB, strB, NULL), URI_SUCCESS);

      asserteq_str (uriA.pathTail->text.afterLast, "?key=value");
      asserteq_str (uriB.pathTail->text.afterLast, "?key=value");

      uriFreeUriMembersA (&uriA);
      uriFreeUriMembersA (&uriB);
   }
✕ Failed:  pathTail:                                                                                                          
    (str) Expected uriB.pathTail->text.afterLast to equal "?key=value", but got "X".                                          
    in parse.c:210(uri_parsing)

Also note that these tests get run through valgrind, which is picking up leaked memory, even after the uriFreeUriMembersA function is called.

==37230== 32 bytes in 1 blocks are indirectly lost in loss record 3 of 8
==37230==    at 0x4858321: calloc (in /usr/local/libexec/valgrind/vgpreload_memcheck-amd64-freebsd.so)
==37230==    by 0x311F2A: uriParsePartHelperTwoA (in /tmp/snc/akips/lib/test-runner)
==37230==    by 0x30F89F: uriParseUriExMmA (in /tmp/snc/akips/lib/test-runner)
==37230==    by 0x30F94A: uriParseSingleUriA (in /tmp/snc/akips/lib/test-runner)
==37230==    by 0x2E8993: snow_test_uri_parsing (in /tmp/snc/akips/lib/test-runner)
==37230==    by 0x246FA0: snow_main_function (in /tmp/snc/akips/lib/test-runner)
==37230==    by 0x246B2F: main (in /tmp/snc/akips/lib/test-runner)
==37230== 
==37230== 32 bytes in 1 blocks are definitely lost in loss record 4 of 8
==37230==    at 0x4858321: calloc (in /usr/local/libexec/valgrind/vgpreload_memcheck-amd64-freebsd.so)
==37230==    by 0x311F2A: uriParsePartHelperTwoA (in /tmp/snc/akips/lib/test-runner)
==37230==    by 0x30F89F: uriParseUriExMmA (in /tmp/snc/akips/lib/test-runner)
==37230==    by 0x30F94A: uriParseSingleUriA (in /tmp/snc/akips/lib/test-runner)
==37230==    by 0x2E888A: snow_test_uri_parsing (in /tmp/snc/akips/lib/test-runner)
==37230==    by 0x246FA0: snow_main_function (in /tmp/snc/akips/lib/test-runner)
==37230==    by 0x246B2F: main (in /tmp/snc/akips/lib/test-runner)
==37230== 
==37230== 64 (32 direct, 32 indirect) bytes in 1 blocks are definitely lost in loss record 6 of 8
==37230==    at 0x4858321: calloc (in /usr/local/libexec/valgrind/vgpreload_memcheck-amd64-freebsd.so)
==37230==    by 0x311F2A: uriParsePartHelperTwoA (in /tmp/snc/akips/lib/test-runner)
==37230==    by 0x30F89F: uriParseUriExMmA (in /tmp/snc/akips/lib/test-runner)
==37230==    by 0x30F94A: uriParseSingleUriA (in /tmp/snc/akips/lib/test-runner)
==37230==    by 0x2E8993: snow_test_uri_parsing (in /tmp/snc/akips/lib/test-runner)
==37230==    by 0x246FA0: snow_main_function (in /tmp/snc/akips/lib/test-runner)
==37230==    by 0x246B2F: main (in /tmp/snc/akips/lib/test-runner)
==37230== 

@scurtis142 I will have a closer look, thanks for the report. With regard to .afterLast there may be a misunderstanding, but let's see, I'll get back to you.

For the first half — the .afterLast part:

.afterLast and .first together define a text range: A pointer to the first character and pointer right after the last character. That is so that you can subtract length := afterLast - first. A text range does not say anything about text outside or following that range. If length is 0 than the range is empty.

With that in mind, let's look at a slightly modified version of your code:

#include <assert.h>
#include <stdio.h>
#include <uriparser/Uri.h>

int main() {
  UriUriA uriA, uriB;
  const char *const strA = "https://example.com/path?key=value";
  const char *const strB = "https://example.com/path/?key=value";

  assert(uriParseSingleUriA(&uriA, strA, NULL) == URI_SUCCESS);
  assert(uriParseSingleUriA(&uriB, strB, NULL) == URI_SUCCESS);

  printf("uriA.pathTail->text.afterLast - uriA.pathTail->text.first = %ld\n",
         uriA.pathTail->text.afterLast - uriA.pathTail->text.first);

  printf("uriB.pathTail->text.afterLast - uriB.pathTail->text.first = %ld\n",
         uriB.pathTail->text.afterLast - uriB.pathTail->text.first);

  uriFreeUriMembersA(&uriA);
  uriFreeUriMembersA(&uriB);

  return 0;
}

Now that will get us:

# gcc -std=c99 -Wall -Wextra -pedantic -Iinclude -L. -luriparser issue116.c && LD_LIBRARY_PATH=. ./a.out
uriA.pathTail->text.afterLast - uriA.pathTail->text.first = 4
uriB.pathTail->text.afterLast - uriB.pathTail->text.first = 0

The first 4 is the length of string "path"while the 0 is the length of the empty string encoding the trailing slash.
We can argue if that's good design or expectable but at least that part is not a "bug" with regard to the current design.
Does that make sense?

I'll look into the memleak part of your report next, stay tuned 😃

With regard to valgrind, I don't get any memleaks with Valgrind 3.17. My guess is that the failing assert in your cases is preventing proper cleanup and that that is causing the leak. If that's true, there is no leak in uriparser. To reproduce:

# gcc -std=c99 -Wall -Wextra -pedantic -Iinclude -L. -luriparser issue116.c && LD_LIBRARY_PATH=. valgrind ./a.out >/dev/null 
==30732== Memcheck, a memory error detector
==30732== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==30732== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==30732== Command: ./a.out
==30732== 
==30732== 
==30732== HEAP SUMMARY:
==30732==     in use at exit: 0 bytes in 0 blocks
==30732==   total heap usage: 6 allocs, 6 frees, 4,200 bytes allocated
==30732== 
==30732== All heap blocks were freed -- no leaks are possible
==30732== 
==30732== For lists of detected and suppressed errors, rerun with: -s
==30732== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Does that make sense? Are you running valgrind a different way?

Thanks for your reply @hartwork.

I was of the understanding that the pointers first and afterlast, were into the original string. But is this not the case? If not, how come my first test case in the original post passes? Or should you assume nothing except for the fact that they should be used as a length calculation?

What I need to know is what is the best way of getting the complete URI path.
My previous code was calculating the full length with uri->pathTail->text.afterLast - uri->pathHead->text.first . So if pathHead and pathTail are pointing into different strings, is this not a safe calculation to do?

I can see from your documentation that path segments are defined as a linked list. So is traversing the linked list the only way to get the full path, or is there a simpler way?

Also, your guess about the leaked memory was correct. My apologies. No issue there.

Hi @scurtis142,

I was of the understanding that the pointers first and afterlast, were into the original string. But is this not the case? If not, how come my first test case in the original post passes? Or should you assume nothing except for the fact that they should be used as a length calculation?

one core principle in uriparser is to be low on RAM requirements, and to e.g. not copy strings around more than needed because of the increase in RAM usage. As a result, the URI path segments re-use existing characters from the original input string, i.e. some of the path text ranges point into that string. Empty segments use a "magic" non-NULL string location so that you can do the usual length := afterLast - first math without issues. You couldn't do that with NULL pointers, because it's undefined behavior according to the C standard.

What I need to know is what is the best way of getting the complete URI path.
My previous code was calculating the full length with uri->pathTail->text.afterLast - uri->pathHead->text.first . So if pathHead and pathTail are pointing into different strings, is this not a safe calculation to do?

I can see from your documentation that path segments are defined as a linked list. So is traversing the linked list the only way to get the full path, or is there a simpler way?

You'll need a loop I'm afraid, yes. You could check how uriToStringA is doing the same thing under the hood.

Also, your guess about the leaked memory was correct. My apologies. No issue there.

No worries, good to be sure.

Ok, thanks for the help :)