fix8mt / uri

Lightweight header-only C++20 URI parser

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

uri

A lightweight C++20 URI parser


Ubuntu

Quick links

Link Description
Here For implementation
Examples For examples
Building How to build or include
API API details
Testing Testing apps
Benchmark Benchmark info
Discussion Discussion

Tip

Use the built-in table of contents to navigate this guide.


1. Introduction

This is a lightweight URI parser implementation featuring zero-copy, minimal storage and high performance.

Motivation

  • header-only
  • zero-copy where possible (base class uses views only)
  • no external dependencies
  • simplicity, lightweight
  • make use of C++20 features
  • entirely constexpr
  • high performance

Features

  • single header-only
  • fast, very lightweight, predictive non brute force parser: avg 52ns1 to decode a URI with basic_uri
  • base class is zero-copy, using std::string_view
  • derived class moves (or copies) source string once
  • all methods constexpr; no virtual methods
  • extracts all components scheme, authority, userinfo, user, password, host, port, path, query, fragment
  • query components returned as std::string_view
  • query decode and search; segment decode; no copying, all results point to uri source
  • small memory footprint - base class object is only 64 bytes
  • support for dynamic or static uri storage
  • built-in unit test cases with exhaustive test URI cases; simple test case addition
  • normalization (RFC 3986)
  • cmake integration with FetchContent

2. Examples

i. Use basic_uri as a view

This example parses a list of URI strings and prints out host component. basic_uri creates a no-copy view of the source. get_component() returns a std::string_view.

source

#include <iostream>
#include <array>
#include <fix8/uri.hpp>
using namespace FIX8;

int main(int argc, char *argv[])
{
   using namespace std::literals;
   static constexpr std::array uris
   {
      "https://www.blah.com:3000/test"sv,
      "https://dakka@www.staylate.net:3000/"sv,
      "https://www.buyexample.com/over/there?name=ferret&time=any#afrag"sv,
   };
   for (const auto& pp : uris)
      std::cout << basic_uri(pp).get_component<uri::host>() << '\n';
   return 0;
}

output

$ ./example1
www.blah.com
www.staylate.net
www.buyexample.com
$

ii. Using a uri

This example parses a URI string and prints out all the contained elements. Then individual components are queried and printed if present.

source

#include <iostream>
#include <fix8/uri.hpp>
using namespace FIX8;

int main(int argc, char *argv[])
{
   const uri u1 {"http://nodejs.org:89/docs/latest/api/foo/bar/qua/13949281/0f28b/5d49/b3020/url.html"
      "?payload1=true&payload2=false&test=1&benchmark=3&foo=38.38.011.293"
      "&bar=1234834910480&test=19299&3992&key=f5c65e1e98fe07e648249ad41e1cfdb0#test"};

   std::cout << u1 << '\n';

   std::cout << u1.get_authority() << '\n'
      << u1.get_host() << '\n'
      << u1.get_port() << '\n'
      << u1.get_query() << '\n'
      << u1.get_fragment() << '\n';
   if (u1.has_user()) // should be no user
      std::cout << u1.get_user() << '\n';
   auto result{u1.decode_query(true)}; // sort result
   std::cout << "key = " << uri::find_query("key", result) << '\n';
   return 0;
}

output

$ ./example2
uri         http://nodejs.org:89/docs/latest/api/foo/bar/qua/13949281/0f28b/5d49/b3020/url.html?payload1=true&payload2=false&test=1&benchmark=3&foo=38.38.011.293&bar=1234834910480&test=19299&3992&key=f5c65e1e98fe07e648249ad41e1cfdb0#test
scheme      http
authority   nodejs.org:89
host        nodejs.org
port        89
path        /docs/latest/api/foo/bar/qua/13949281/0f28b/5d49/b3020/url.html
   docs
   latest
   api
   foo
   bar
   qua
   13949281
   0f28b
   5d49
   b3020
   url.html
query       payload1=true&payload2=false&test=1&benchmark=3&foo=38.38.011.293&bar=1234834910480&test=19299&3992&key=f5c65e1e98fe07e648249ad41e1cfdb0
   payload1    true
   payload2    false
   test        1
   benchmark   3
   foo         38.38.011.293
   bar         1234834910480
   test        19299
   3992        (empty)
   key         f5c65e1e98fe07e648249ad41e1cfdb0
fragment    test

nodejs.org:89
nodejs.org
89
payload1=true&payload2=false&test=1&benchmark=3&foo=38.38.011.293&bar=1234834910480&test=19299&3992&key=f5c65e1e98fe07e648249ad41e1cfdb0
test
key = f5c65e1e98fe07e648249ad41e1cfdb0
$

iii. Using a uri_static

Create a static URI with a maximum storage of 256 bytes, from the supplied string. Print out the result and max storage.

source

#include <iostream>
#include <fix8/uri.hpp>
using namespace FIX8;

int main(int argc, char *argv[])
{
   uri_static<256> u1{"mailto:John.Smith@example.com"};
   std::cout << u1 << '\n';
   std::cout << "max storage " << u1.max_storage() << '\n';
   return 0;
}

output

$ ./example3
uri         mailto:John.Smith@example.com
scheme      mailto
path        John.Smith@example.com
max storage 256
$

iv. Use factory or format

Create a URI from an initializer list or using format specification. Note we chose to percent-encode a part of the path. Print out the results. Both the factory and format versions produce the same uri.

source (factory)

#include <iostream>
#include <fix8/uri.hpp>
using namespace FIX8;
using enum uri::component;

int main(int argc, char *argv[])
{
   const auto u1 { uri::factory({{scheme, "https"}, {user, "dakka"}, {host, "www.blah.com"}, {port, "3000"},
      {path, "/foo/" + basic_uri::encode_hex("this path has embedded spaces") + "/test"}}) };
   std::cout << u1 << '\n';
   return 0;
}

output (factory)

$ ./example4
uri         https://dakka@www.blah.com:3000/foo/this%20path%20has%20embedded%20spaces/test
scheme      https
authority   dakka@www.blah.com:3000
userinfo    dakka
user        dakka
host        www.blah.com
port        3000
path        /foo/this%20path%20has%20embedded%20spaces/test
   foo
   this%20path%20has%20embedded%20spaces
   test
$

source (format)

#include <iostream>
#include <fix8/uri.hpp>
using namespace FIX8;

int main(int argc, char *argv[])
{
   const auto u1 { uri::format("{}://{}@{}:{}/{}/{}/{}", "https", "dakka", "www.blah.com", "3000", "foo",
      basic_uri::encode_hex("this path has embedded spaces"), "test") };
   std::cout << u1 << '\n';
   return 0;
}

output (format)

$ ./example4
uri         https://dakka@www.blah.com:3000/foo/this%20path%20has%20embedded%20spaces/test
scheme      https
authority   dakka@www.blah.com:3000
userinfo    dakka
user        dakka
host        www.blah.com
port        3000
path        /foo/this%20path%20has%20embedded%20spaces/test
   foo
   this%20path%20has%20embedded%20spaces
   test
$

v. Using edit

Create a URI and then edit it.

source

#include <iostream>
#include <fix8/uri.hpp>
using namespace FIX8;
using enum uri::component;

int main(int argc, char *argv[])
{
   uri u1 { "https://dakka@www.blah.com:3000" };
   std::cout << u1 << '\n';
   u1.edit({{port, "80"}, {user, ""}, {path, "/newpath"}});
   std::cout << '\n' << u1 << '\n';
   return 0;
}

output

$ ./example5
uri         https://dakka@www.blah.com:3000
scheme      https
authority   dakka@www.blah.com:3000
userinfo    dakka
user        dakka
host        www.blah.com
port        3000
path        (empty)

uri         https://www.blah.com:80/newpath
scheme      https
authority   www.blah.com:80
host        www.blah.com
port        80
path        /newpath
$

3. Building

This implementation is header only. Apart from standard C++20 includes there are no external dependencies needed in your application. Catch2 is used for the built-in unit tests. Criterion is used for benchmarking.

i. Obtaining the source, building the examples

To clone and default build the test app, unit tests and the benchmark:

git clone https://github.com/fix8mt/uri.git
cd uri
mkdir build
cd build
cmake ..
make -j4
make test (or ctest)

To disable building the benchmarks, pass the following switch to cmake:

# cmake -DBUILD_BENCHMARKS=false ..

ii. Using in your application with cmake

In CMakeLists.txt set your include path to:

include_directories([uri directory]/include)
# e.g.
set(uridir /home/dd/prog/uri)
include_directories(${uridir}/include)

and just include:

#include <fix8/uri.hpp>

in your application. Everything in this class is within the namespace FIX8, so you can add:

using namespace FIX8;

This is C++20, so you can also add in your local code:

using enum uri::component;

iii. Integrating uri in your project with cmake FetchContent

You can use cmake FetchContent to integrate uri with your project. If your project was called myproj with the sourcefile myproj.cpp then...

project(myproj)
add_executable (myproj myproj.cpp)
set_target_properties(myproj PROPERTIES CXX_STANDARD 20 CXX_STANDARD_REQUIRED true)
message(STATUS "Downloading uri...")
include(FetchContent)
FetchContent_Declare(uri GIT_REPOSITORY https://github.com/fix8mt/uri.git)
FetchContent_MakeAvailable(uri)
target_include_directories(myproj PRIVATE ${uri_SOURCE_DIR}/include)

4. API

i. Class hierarchy

basic_uri

The base class basic_uri performs the bulk of the work, holding a std::string_view of the source uri string. If you wish to manage the scope of the source uri yourself then this class is the most efficient way to use uri functionality.

basic_uri u1{"https://www.example.com:8080/path1"};

uri_base

This class is aliased by uri and uri_static. You can inherit from class if you wish to specialise further.


uri

The derived class uri stores the source string and then builds a basic_uri using that string as its reference. uri derives from basic_uri and a private dynamic storage class uri_storage. The supplied string is moved or copied and stored by the object. If your application needs the uri to hold and persist the source uri, this class is suitable. The storage class used is a specialisation of uri_storage which specifies 0 as the non-type parameter sz, selecting dynamic storage.

std::string myuri;
.
.
.
uri u1{myuri};

class diagram


uri_static

The derived class uri_static stores the source string and then builds a basic_uri using that string as its reference. uri_static derives from basic_uri and a private static storage class uri_storage. The supplied string is moved or copied and stored by the object. The class is templated by the non-type parameter sz which sets the static size and maximum storage capacity for the uri. sz defaults to 1024. Storage is allocated once with the object in a std::array. No dynamic memory is used. If your application needs the uri to hold and persist the source uri statically (for example in another container), this class is suitable.

std::string myuri;
.
.
.
uri_static<256> u1{myuri};

class diagram (static)

ii. Types

component

enum component { scheme, authority, userinfo, user, password, host, port, path, query, fragment, countof };

Components are named by a public enum called component. Note that the component user and password are populated if present and userinfo will also be populated.

other types

Type Typedef of Description
uri_len_t std::uint16_t the integral type used to store offsets and lengths
value_pair std::pair<std::string_view,std::string_view> used to return tag value pairs
query_result std::vector<value_pair> used to return a collection of query pairs
range_pair std::pair<uri_len_t,uri_len_t> used to store offset and length
comp_pair std::pair<component, std::string_view> used by factory to pass individual component pairs
comp_list std::vector<std::string_view> used by factory,edit and make_source to pass individual component values; each position in the vector corresponds to the component index
segments same as comp_list used by decode_segments
port_pair same as value_pair used by find_port
error enum class error : uri_len_t { no_error, too_long, illegal_chars, empty_src, countof }; error types

consts

Const Description
uri_max_len the maximum length of a supplied uri

iii. Construction and destruction

ctor

class basic_uri;
constexpr basic_uri(std::string_view src);                           // (1)
constexpr basic_uri(int bits);                                       // (2)
constexpr basic_uri();                                               // (3)

class uri;
constexpr uri(std::string src);                                      // (4)
constexpr uri(std::string_view src);                                 // (5)
constexpr uri(const char *src);                                      // (6)
constexpr uri();                                                     // (7)

template<size_t sz>
class uri_static;
constexpr uri_static(std::string src);                               // (8)
constexpr uri_static(std::string_view src);                          // (9)
constexpr uri_static(const char *src);                               // (10)
constexpr uri_static();                                              // (11)
  1. Construct a basic_uri from a std::string_view. This base class does not store the string. Calls parse(). The source string must not go out of scope to use this object. If parsing fails, you can check for error using operator bool or count() and then get_error() for more info. Since this method takes a std::string_view you can declare objects constexpr. Note that std::string contains a convert to std::string_view operator.
  2. Construct a basic_uri that has the corresponding bitset passed in bits. No components are present. Permits object to be used as a component bitset.
  3. Construct an empty basic_uri. It can be populated using assign().
  4. Construct a uri from a std::string. Calls parse(). The supplied string is moved or copied and stored by the object. You can check for error using operator bool or count() and then get_error() for more info.
  5. Construct a uri from a std::string_view. Creates a std::string from src and delegates to (4).
  6. Construct a uri from a null terminated const char *. Creates a std::string from src and delegates to (4).
  7. Construct an empty uri. It can be populated using replace().
  8. Construct a uri_static from a std::string. The class is templated by the non-type parameter sz which sets the static size and maximum storage capacity for the uri. Calls parse().
  9. Construct a uri_static from a std::string_view. Creates a std::string from src and delegates to (8).
  10. Construct a uri_static from a null terminated const char *. Creates a std::string from src and delegates to (8).
  11. Construct an empty uri_static from a std::string. The class is templated by the non-type parameter sz which sets the static size and maximum storage capacity for the uri.

All of uri is within the namespace FIX8.

dtor

constexpr ~basic_uri();
constexpr ~uri();
constexpr ~uri_static();

Destroy the uri or basic_uri. The uri and uri_static objects will release the stored string.

iv. Accessors

test

constexpr bool test(uri::component what) const;
template<uri::component what>
constexpr bool test() const;
template<component... comp>
constexpr int test_any() const;
template<component... comp>
constexpr int test_all() const;

Return true if the specified component is present in the uri. Passing countof returns true if any component is present. Use the template version if you know the component ahead of time. test_any can be used to test for multiple components (any or all) in a single statement. As above, use the template version if you know the component ahead of time. See the test case "test any/all range" for example use.

has_any

constexpr bool has_any() const;

Return true if any component is present.

has_[?]

constexpr bool has_[?component]() const;

Return true if the specified component (scheme, authority, userinfo, user, password, host, port, path, query, fragment) is present in the uri.

const uri u1{"https://www.hello.com:8080/"};
if (u1.has_port())
   .
   .
   .

get_component

constexpr std::string_view get_component(component what) const;
template<component what>
constexpr std::string_view get_component() const;

Return a std::string_view of the specified component or empty if component not found. Returns an empty std::string_view if not found or not a legal component. Use the template version if you know the component ahead of time.

const uri u1{"https://www.hello.com:8080/"};
uri::component what{uri::host};
std::cout << u1.get_component(what) << '\n';
std::cout << u1.get_component<uri::host>() << '\n';

get_[?]

constexpr std::string_view get_[?component]() const;

Return a std::string_view of the specified component (scheme, authority, userinfo, user, password, host, port, path, query, fragment). Returns an empty std::string_view if not found or not a legal component.

const uri u1{"https://www.hello.com:8080/"};
std::cout << u1.get_host() << '\n';

get_present

constexpr uri_len_t get_present() const;

Return the present bitset as uri_len_t which has bits set corresponding to the component's enum position.

operator bool

constexpr operator bool() const;

Returns true if parsing was successful, false on fail.

get_error

constexpr error get_error() const;

Return the last uri::error error enum. If no error returns error::no_error. Use it to obtain the reason a uri failed to parse.

const operator[component]

constexpr const range_pair& operator[](component idx) const;

Return a const range_pair& which is a std::pair<uri_len_t, uri_len_t>& to the specified component at the index given in the ranges table. This provides read-only access to the offset and length of the specified component and is used to create a std::string_view.

Warning

This is not range checked.

const at

template<component what>
constexpr const range_pair& at() const;

Return a const range_pair& which is a std::pair<uri_len_t, uri_len_t>& to the specified component at the component given as a template parameter. This provides read-only access to the offset and length of the specified component and is used to create a std::string_view. No copying, results point to uri source. Use this template version if you know the component ahead of time, otherwise use the subscript operator.

Warning

This is not range checked.

in_range

constexpr int in_range(std::string_view::size_type pos) const;

Return a bitset of all components that the given position in a uri lie within. You can use bitsum to test results. See the "in range" test case for example use.

decode_query

template<char separator='&',char tagequ='='>
constexpr query_result decode_query(bool sort=false) const;

Returns a std::vector of pairs of std::string_view of the query component if present. You can optionally override the value pair separator character using the first non-type template parameter - some queries use ;. You can also optionally override the value equality separator character using the second non-type template parameter (some queries use :). Pass true to optionally sort the query_result lexicographically by the key. No copying, results point to uri source. Returns an empty vector if no query was found. The query is assumed to be in the form:

&tag=value[&tag=value...]

Or if you override, say

;tag:value[;tag:value...]

If no value is present, just the tag will be populated with an empty value.

find_query

static constexpr std::string_view find_query (std::string_view what, const query_result& from);

Find the specified query key and return its value from the given query_result. query_result must be sorted by key, as returned when passing true to decode_query or by calling sort_query first. If key not found return empty std::string_view. No copying, results point to uri source. Complexity at most $2 * log^2(last - first) + O(1)$ comparisons.

decode_hex

static constexpr std::string decode_hex(std::string_view src, bool unreserved=false);
static constexpr std::string& decode_hex(std::string& result, bool unreserved=false); // in place decode

Decode any hex values present in the supplied string. Hex values are only recognised if they are in the form %XX where X is a hex digit (octet) [0-9a-fA-F]. By default all percent-encoded hex values are decoded. Return in a new string or in place. If unreserved is true only unreserved characters will be decoded (see is_unreserved()).

encode_hex

static constexpr std::string encode_hex(std::string_view src);

Encode any hex values present in the supplied string. Hex values are only recognised if they are in the form %XX where X is a hex digit (octet) [0-9a-fA-F]. Only chars that are reserved (see is_reserved()), whitespace or not printable are encoded. Return in a new encoded string.

is_unreserved

static constexpr bool is_unreserved(char c);

Return true if the given char is a member of the unreserved set as per RFC 3986, sec 2.3.

is_reserved

static constexpr bool is_reserved(char c);

Return true if the given char is a member of the reserved set as per RFC 3986, sec 2.2.

has_hex

static constexpr bool has_hex(std::string_view src);

Return true if any hex values are present in the supplied string. Hex values are only recognised if they are in the form %XX where X is a hex digit (octet) [0-9a-fA-F].

find_hex

static constexpr std::string_view::size_type find_hex(std::string_view src, std::string_view::size_type pos=0);

Return the position of the first hex value (if any) in the supplied string. Optionally supply the starting offset in pos. Hex values are only recognised if they are in the form %XX where X is a hex digit (octet) [0-9a-fA-F]. If not found returns std::string_view::npos.

find_port

static constexpr std::string_view find_port(std::string_view what);

Return the default port as a std::string_view for the given scheme. For example, will return 80 if given http. Uses private member _default_ports which contains pairs of scheme/ports.

decode_segments

constexpr segments decode_segments(bool filter=true) const;

Returns a std::vector of segments as std::string_view of the path component if present. If filter is true (default) remove ./ segments if found. Returns an empty vector if no path was found.

normalize_str

static constexpr std::string normalize_str(std::string_view src);

Normalize the given string as per RFC 3986, sec 6. The normalizations done are only those that preserve the original semantics. These are:

  1. Convert scheme => lower case
  2. Convert host => lower case
  3. Convert %hex => upper case
  4. Decode unreserved hex
  5. Remove dot segments (sec 5.2.4)
  6. Convert empty path to "/"

Returns a std::string of the new normalized string or the same string if no normalizations possible.

normalize_http_str

static constexpr std::string normalize_http_str(std::string_view src);

Normalize the given string as per RFC 3986, sec 6, as in normalize_str() above. In addition the following normalizations are done:

  1. Remove default port (http and https only)

Returns a std::string of the new normalized string or the same string if no normalizations possible.

normalize

constexpr std::string normalize();

Same as normalize_str above but operates on the source string in the uri object. Returns the original string and updates the current object with the new normalized string.

normalize_http

constexpr std::string normalize_http();

Same as normalize_http_str above but operates on the source string in the uri object. Returns the original string and updates the current object with the new normalized string. The following example demonstrates the use and results of normalize_http (no exception thrown):

#include <cassert>
#include <iostream>
#include <exception>
#include <array>
#include <fix8/uri.hpp>
using namespace FIX8;

int main(void)
{
   using namespace std::literals;
   static constexpr std::array uris
   {
      "https://www.test.com/"sv, // all should normalize to this one
      "https://www.test.com"sv,
      "https://www.test.com:/"sv,
      "https://www.test.com:443/"sv,
   };
   try
   {
      for (const auto pp : uris)
      {
         uri u1{pp};
         u1.normalize_http();
         if (u1.get_uri() != uris[0]);
            throw std::logic_error("http normalization failure");
      }
   }
   catch(const std::exception& e)
   {
      std::cerr << e.what() << '\n';
   }
   return 0;
}

get_name

static constexpr std::string_view get_name(component what);
template<component what>
static constexpr std::string_view get_name();

Return a std::string_view of the specified component name. Returns an empty std::string_view if not found or not a legal component. Use the template version if you know the component ahead of time.

get_uri

constexpr std::string_view get_uri() const;

Return a std::string_view of the source uri. If not set return value will be empty.

count

constexpr int count() const;

Return the count of components in the uri.

operator<<

friend std::ostream& operator<<(std::ostream& os, const basic_uri& what);

Print the uri object to the specified stream. The source and individual components are printed. If a query is present, each tag value pair is printed; if a path is present, each segment value is also printed.

operator==

friend constexpr bool operator==(const basic_uri& lhs, const basic_uri& rhs);
friend constexpr bool operator==(const uri& lhs, const uri& rhs);
template<size_t sz>
friend static constexpr bool uri_static<sz>::operator==(const uri_static& lhs, const uri_static& rhs);

Equivalence operators for basic_uri, uri and uri_static. These are implemented as follows:

  1. basic_uri - return true if the source uri strings are identical
  2. uri, uri_static - return true if the normalized source uri strings are identical

operator%

friend constexpr bool operator%(const uri& lhs, const uri& rhs);
template<size_t sz>
friend static constexpr bool uri_static<sz>::operator%(const uri_static& lhs, const uri_static& rhs);

Equivalence operators for http protocol for uri and uri_static. Return true if the normalized_http uri strings are identical.

get_buffer

constexpr const std::string& get_buffer() const;

Return a const std::string& to the stored buffer. Only available from uri.

has_any_authority

constexpr bool has_any_authority() const;

Returns true if any authority components are present. This means any one of host, password, port, user or userinfo.

has_any_userinfo

constexpr bool has_any_userinfo() const;

Returns true if any userinfo components are present. This means any one of user or password.

buffer

constexpr std::string_view buffer() const;

Returns a std::string_view of the current buffer used for all uri objects except basic_uri.

max_storage

static constexpr uri_len_t max_storage();

Returns the maximum storage available for all uri objects except basic_uri. For uri will return uri_max_len, for uri_static<sz> will return the sz parameter.

v. Mutators

set

constexpr void set(uri::component what);
template<uri::component what>
constexpr void set();
template<component... comp>
constexpr void set_all();

Set the specified component bit as present in the uri. Passing uri::countof sets all bits. Use the template version if you know the bit ahead of time. Use carefully. set_all can be used to set multiple components in a single statement. As above, use the template version if you know the component ahead of time. See the test case "clear/set all range" for example use.

clear

constexpr void uri::clear(uri::component what);
template<uri::component what>
constexpr void clear();
template<component... comp>
constexpr void clear_all();

Clear the specified component bit in the uri. Passing uri::countof clears all bits. Use the template version if you know the bit ahead of time. Use carefully. clear_all can be used to clear multiple components in a single statement. As above, use the template version if you know the component ahead of time. See the test case "clear/set all range" for example use.

assign

constexpr int assign(std::string_view src);

Replace the current uri reference with the given reference. No storage is allocated. Return the number of components found.

replace

constexpr std::string replace(std::string&& src);

Replace the current uri with the given string. The storage is updated with a move (or copy) of the string. The old string is returned.

set_error

constexpr void set_error(error what);

Set the last uri::error error to the error given. Setting an error is destructive and renders the uri unusable.

operator[component]

constexpr range_pair& operator[](component idx);

Return a range_pair& which is a std::pair<uri_len_t, uri_len_t>& to the specified component at the index given in the ranges table. This provides direct access to the offset and length of the specified component and is used to create a std::string_view.

Warning

This is not range checked. Allows for modification of the string_view range. Use carefully.

at

template<component what>
constexpr range_pair& at();

Return a range_pair& which is a std::pair<uri_len_t, uri_len_t>& to the specified component at the component given as a template parameter. This provides direct access to the offset and length of the specified component and is used to create a std::string_view. Use this template version if you know the component ahead of time, otherwise use the subscript operator.

uri u1{"https://www.hello.com:8080/"};
auto& rp{u1.at<uri::host>()};
std::cout << rp.first << ' ' << rp.second << '\n';

Warning

This is not range checked. Allows for modification of the string_view range. Use carefully.

parse

constexpr int parse();

Parse the source string into components. Return the count of components found. Will reset a uri if already parsed. You can check for error using get_error() for more info.

sort_query

static constexpr void sort_query(query_result& query);

Sort the supplied query alphanumerically based on the tag in the query value pair. Complexity at most $2 * log^2(last - first) + O(1)$ comparisons.

vi. Generation and editing

factory

static constexpr uri uri::factory(std::initializer_list<comp_pair> from);
template<size_t sz>
static constexpr uri_static<sz> uri_static<sz>::factory(std::initializer_list<comp_pair> from);

Create a uri from the supplied components. The initializer_list contains a 1..n comp_pair objects. The following constraints apply:

  1. If authority is supplied and any of the following components are present host, password, port, user or userinfo then authority is ignored;
  2. If userinfo is supplied and any of the following components are present user or password then userinfo is ignored;

format

template<typename... Args>
static constexpr uri format(std::format_string<Args...> fmt, Args&&... args);

Create a uri from the supplied format string and arguments. See std::format for more on how to use this function. A uri will be created from the resulting string. See above for example usage.

edit

constexpr int edit(std::initializer_list<comp_pair> from);

Modify an existing uri by replacing existing components with the supplied components. Components not specified are left unchanged. The initializer_list contains a 1..n comp_pair objects. The same constraints as factory apply.

make_uri

static constexpr std::string make_uri(std::initializer_list<comp_pair> from);

Construct a std::string representation of a uri from the supplied components. The initializer_list contains a 1..n comp_pair objects. The same constraints as factory apply.

5. Testing

Test cases

The header file uriexamples.hpp contains a data structure holding the test cases used by the Catch2 unit test app unittests and by the CLI test app uritest. You can add your own test cases to uriexamples.hpp - the structure is easy enough to follow.

sample

const std::vector<std::pair<const char *, std::vector<std::pair<uri::component, const char *>>>> tests
{
   { "https://www.blah.com/",
      {
         { scheme, "https" },
         { authority, "www.blah.com" },
         { host, "www.blah.com" },
         { path, "/" },
      }
   },
   { "https://www.blah.com",
      {
         { scheme, "https" },
         { authority, "www.blah.com" },
         { host, "www.blah.com" },
         { path, "" }, // empty path
      }
   },
   { "https://www.blah.com:3000/test",
      {
         { scheme, "https" },
         { authority, "www.blah.com:3000" },
         { host, "www.blah.com" },
         { port, "3000" },
         { path, "/test" },
      }
   },
   { "https://dakka@www.blah.com:3000/",
      {
         { scheme, "https" },
         { authority, "dakka@www.blah.com:3000" },
         { user, "dakka" },
         { host, "www.blah.com" },
         { port, "3000" },
         { path, "/" },
      }
   },
.
.
.

unittests

This application is run by default if you run make test or ctest. When running using ctest use the following command:

$ ctest --output-on-failure

Review of the test cases in unittests.cpp will provide more insight into using the API.

uritest

This is a simple CLI test app which allows you to run individual or all tests from uriexamples.hpp, or test a uri passed from the command line.

$ ./uritest -h
Usage: ./uritest [uri...] [-t:T:d:hlasxf:]
 -a parse all examples (default)
 -d [uri] parse uri from CLI, show debug output
 -h help
 -l list tests
 -s show sizes
 -f [file] read and dump from file
 -T [num] static test to run
 -t [num] test to run
 -x special tests
$

adhoc test

You can an run adhoc test from the CLI as follows:

output

$ ./uritest -d "https://user:password@example.com:3000/path?search=1&key=val&when=now#frag"
uri         https://user:password@example.com:3000/path?search=1&key=val&when=now#frag
scheme      https
authority   user:password@example.com:3000
userinfo    user:password
user        user
password    password
host        example.com
port        3000
path        /path
query       search=1&key=val&when=now
   search      1
   key         val
   when        now
fragment    frag

bitset 1111111111 (0x3ff)
scheme 0 (5)
authority 8 (30)
userinfo 8 (13)
user 8 (4)
password 13 (8)
host 22 (11)
port 34 (4)
path 38 (5)
query 44 (25)
fragment 70 (5)
$

read from file

You can also read uris from a file and print out the parsed results. This example reads from the supplied basiclist.hpp:

output

$ ./uritest -f ../examples/basiclist.hpp
uri         https://telegraph.co.uk/index.html
scheme      https
authority   telegraph.co.uk
host        telegraph.co.uk
path        /index.html

uri         https://bp.blogspot.com/index.html
scheme      https
authority   bp.blogspot.com
host        bp.blogspot.com
path        /index.html
.
.
.
uri         https://android.com/index.html
scheme      https
authority   android.com
host        android.com
path        /index.html

uri         https://blog.me/index.html"
scheme      https
authority   blog.me
host        blog.me
path        /index.html"

1000 uri(s) read from ../examples/basiclist.hpp
$

benchmarks

We use the Criterion benchmarking library. The benchmark app is built by default. The file basiclist.hpp contains 1000 generic URIs. The benchmark creates 1000 basic_uri, uri and uri_static objects and measures the total time taken. We can calculate the average time to decode each URI1.

Benchmarks

From the above results we can see the following average performance per URI:

Class Decode(ns)
basic_uri 52 ns
uri 90 ns
uri_static 90 ns

6. Discussion

i. Non-validating

This class is non-validating. The source URI is expected to be normalised or at least parsable. This library provides normalization functions which you can apply to your source uri strings before construction.

Validation is out of scope for this implementation. We decided against validating for a few reasons:

  1. Performance - validating is expensive; most URIs are generally parsable
  2. Complex - validation rules are complicated; for most use cases, simple rejection for gross rule violation is sufficient. See URL Standard for complete validation rules.

ii. Low level access

There are methods that provide direct access to the range table and component bitsets. You must ensure that you don't pass an invalid component index when using these. Making changes to the range object with operator[] can have serious consequences. Use carefully.

  1. constexpr range_pair& operator[](component idx);
  2. constexpr void set(component what);
  3. template<component what> constexpr void set();
  4. constexpr void clear(component what);
  5. template<component what> constexpr void clear();

iii. Sanity checking

This class will perform basic sanity checks on the source URI and refuses to continue parsing. You can test for failure using the operator bool. These are:

  1. Length - source must not exceed uri_max_len (UINT16_MAX)
  2. Illegal chars - source must not contain any whitespace characters

iv. Performance

This class performs well, with minimal latency. Since there is no copying of strings or sub-strings, the decoding functionality in basic_uri uses minimal cycles - especially for applications that can manage the storage of the source string themselves. The memory footprint of basic_uri is 64 bytes and will fit in a cache-line.

  • If storage of the source is needed, uri performs a single string copy (or move), and aside from that will have the same performance as basic_uri.

  • If you need to store the source URI but wish to avoid using dynamic memory, use uri_static (for example for including in another container). This ensures a single allocation for the entire object. For most purposes (and excluding edits) a statically stored URI is the most efficient storage option. Be aware that the template parameter sz must be large enough for any URI you wish to store and of course objects created with different templated sizes will be different types.

  • The factory and edit have more copying although even these still use std::string_view where possible with actual copying of strings or sub-strings occurring once at most.

  • With all methods constexpr and noexcept, no virtual methods, header only and optimisation enabled, your compiler should be able to optimise your code most efficiently.

  • If you want to reduce the size of basic_uri further, you can change:

using uri_len_t = std::uint16_t;

to:

using uri_len_t = std::uint8_t;

This will limit the maximum length of a URI to 256 bytes, but reduce the overall storage needed for basic_uri from 64 to 40 bytes.

Footnotes

  1. Ubuntu 23.10, 12 core 4.7GHz Intel i7 Cometlake Processors, 15.3GB RAM; gcc-13.2.0 2