uuid-rs / uuid

Generate and parse UUIDs.

Home Page:https://www.crates.io/crates/uuid

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

UUIDv6–v8 Support

XAMPPRocky opened this issue · comments

Is your feature request related to a problem? Please describe.
There is a new draft RFC for UUID formats that are more database friendly, it would be nice if these were supported in the uuid crate, since the new formats share a lot of the same internals.

Describe the solution you'd like
An implementation of UUIDv6, v7, and v8 from the IETF RFC.

Is it blocking?
No, it would just be cool to have.


EDIT (@KodrAus): The RFC is being worked on here

Looks like I've got some reading to do 👀 These are basically ULIDs are they?

I hadn't seen ULID before, but looking at the project it seems very similar, UUID has more variety in what kind of timestamp you can use.

Today I ran across a project that collects all UUIDv6 prototypes. It lacks Rust implementation :) If you add UUIDv6 support to your library, it would be great

https://github.com/uuid6/prototypes/

GitHub
Draft Prototypes and Tests for UUIDv6 and beyond. Contribute to uuid6/prototypes development by creating an account on GitHub.

I hadn't seen ULID before, but looking at the project it seems very similar, UUID has more variety in what kind of timestamp you can use.

Looks like I've got some reading to do 👀 These are basically ULIDs are they?

As I understand, UUIDv6 can be sorted and the sorted list would be in the generated time order that make UUIDv6 much more desirable for using as primary key or alike in databases

Thanks @XAMPPRocky and @VladimirMarkelov. I'll spend some time unpicking the spec to see if it would make sense to share our v1 timestamp APIs with v6 or whether there are changes we should make to unify them.

Ok, based on a really quick reading it looks like our Timestamp and ClockSequence should already be suitable at least for v6 and v7 UUIDs. It already can convert between the original Gregorian format and a standard Unix epoch. I haven't looked at v8 just yet.

Do we have any experience so far from similar implementations on what a good ratio of subsecond precision / clock sequence / random data we should target for v7 UUIDs? We already have access to 32bits (possibly low precision though) of subsecond data and 16bits for a clock sequence. If we used all that I think it would leave us with 38bits for random data, which seems a little low.

This will be great to have!

Do we have any experience so far from similar implementations on what a good ratio of subsecond precision / clock sequence / random data we should target for v7 UUIDs?

Would it be possible to make it user-configurable?

Hey @coreh 👋 Are you currently looking at using these new UUID variants? If so do you have any constraints or experiences to share?

We could look at making this configuration for sure, but whether or not we do that unconditionally (in something like Uuid::new_v7) or as an advanced API (in the Builder) would depend on how important they are to tweak. It might also be possible to set a precision on the clock sequence too so it tells the UUID how many bits it wants.

Yeah, I'm currently looking at using them for a hobby project (very early, in concept stage). My use case is to have a bunch of (untrusted and sandboxed) WASM applications running in a mesh network of hosts. Resources are identified by UUIDs, "owned" by the different applications, and are shared and accessible (via message passing) throughout the entire network. Applications can also be "migrated" between hosts.

My idea is to use the the highest amount of bits possible from the subsec_seq_node portion to uniquely identify applications for message routing (so that nodes can query other nodes for who's running that application) and to avoid giving the ability for applications to "bruteforce" or "scan" for the presence of other applications in the mesh network. I suspect most apps will create very few resources (say < 100,000) so this arrangement will probably work fine. However, I also wanted to keep the door open for some resource-heavy apps (that might need to create millions or billions of resources, quickly) to request a different allocation scheme with a shorter node portion in subsec_seq_node and more time/pseudorandom bits.

I also suspect (but I'm not entirely sure) the big endian, lexicographical sorting aspect of UUIDv7 will help with compression over the wire, especially when serializing/deserializing resource state when applications are migrated between nodes.

Ah I see 🤔 I was hoping we could use the same clock infrastructure we have for v1 UUIDs for this too (we already expose APIs for working with Unix epoch dates). Maybe they could determine the precision somehow so we can fill the rest with random data.

I saw the RFC got pushed back to next year. I don’t think I’ve ever looked at an IETF RFC that was in draft before so am not sure how normal that is 🙂

I saw the RFC got pushed back to next year. I don’t think I’ve ever looked at an IETF RFC that was in draft before so am not sure how normal that is

FWIW I've not been involved with IETF, but just from my experience implementing the RFCs, there's plenty of tech in production that lives in the draft or proposed RFC space for a long time, it more represents how much change there could be. For example; HTTP/2 (RFC 7540) and QUIC (RFC 9000) are still considered proposed standards by the IETF despite everyone using them.

The new UUID formats are being discussed in this github repository: https://github.com/uuid6/uuid6-ietf-draft

GitHub
UUID version 6 IETF draft. Contribute to uuid6/uuid6-ietf-draft development by creating an account on GitHub.

HTTP/2 (RFC 7540) and QUIC (RFC 9000) are still considered proposed standards by the IETF despite everyone using them.

Wow! Ok there you go.

The new UUID formats are being discussed in this github repository

Ah thanks @fabiolimace! 👀

I'll sketch this out on a branch and we can decide whether or not we want to ship them under a v6-draft / v7-draft / v8-draft feature so people can use them, with the caveat that they may change based on the draft spec.

So we currently have access to 96bits of timestamp (64bits of timestamp and 32bits of nanos) Timestamp::from_unix. I'm thinking if it's possible we should flip the internal representation of Timestamp to keep all that information (so it'll become 16bits bigger), and then introduce a TimestampPrecision type that's added to ClockSequence and can be used to truncate a timestamp.

Maybe something like:

pub struct Precision(TimestampPrecision, SequencePrecision);

pub struct TimestampPrecision(u8);

impl TimestampPrecision {
    pub const SECONDS: Self = Self(0);
    pub const MILLIS: Self = Self(3);
    ..

    pub fn from_places(places: u8) -> Self { .. }
}

pub struct SequencePrecision(u8);

impl SequencePrecision {
    pub const V1: Self = Self(14);

    pub fn from_bits(bits: u8) -> Self { .. }
}

pub trait ClockSequence {
    fn precision(&self) -> Precision { Precision(TimestampPrecision::MILLIS, SequencePrecision::V1) }

    fn generate_sequence(&self, seconds: u64, subsec_nanos: u32) -> u16;
}

impl Uuid {
    // Without precision we'll assume 96(?)bits of timestamp and node are for the timestamp
    // This can produce some jittery results beyond the actual precision of the timestamp stored
    fn to_timestamp_with_precision(&self, precision: Precision) -> Timestamp { .. }
}

impl Uuid {
    // Any truncated portion of the timestamp to the end of the UUID will be filled with random data
    // The precision used to generate the timestamp will determine when it truncates
    pub fn new_v7(ts: Timestamp) -> Result<Self, crate::Error> { .. }
}

The idea being we can do this without any breakage to v1 but without also introducing a lot of new very similar infrastructure for v7 and v8.

What do you all think?

I see the reference Python implementation has these defaults for v7:

sec_bits = 36 # unixts at second precision
subsec_bits = 30 # Enough to represent NS
sequence_bits = 8 # Enough for 256 UUIDs per NS

We could probably use the same.

We're working towards a 1.0 release of uuid now. I'm confident we'll be able to support v6-v8 UUIDs without any disruptive breakage so I don't think this needs to block that.

I think the sketch I made above is a bit unnecessary. All we really need is a way to retain all the precision in Timestamp for a Unix timestamp, which we can do.

Is anyone working on this ? I could make a PR for v7.

Hi @malobre 👋 There isn't any active work on this right now. I haven't been following the RFC too closely, but was under the impression that things were changing in it so figured I'd check back in with it later.

If you'd like to work on some v7 support we can work out how to ship it, either in uuid as an unstable API, or externally, so we can keep up with the RFC as it evolves if necessary.

I have forked this repo and started to play with things on the v7 branch.

I will be putting any public facing additions behind the v7 feature and uuid_unstable cfg.

Working with v1::Timestamp is a little awkward for UUIDv7, i.e: we want milliseconds since unix epoch, but the public api (e.g: Timestamp::from_unix taking seconds & nanos instead of ms) and internal representation are tailored for UUIDv1 (number of 100-nanosecond intervals since gregorian epoch).

It makes sense to share the struct between v1 & v6, but for v7 not so much.
Should I define a separate v7::Timestamp struct or should I try to make it work with the current v1::Timestamp ?

Maybe this is something to take into account

Super excited to have uuid v7 support. @malobre does your fork currently work?

@malobre does your fork currently work?

No, I only made some groundwork as I intended to clarify some things beforehand.

In the meantime, here's a snippet for single-node, non-batched, UUIDv7 generation:

let uuid = {
    use std::time::{SystemTime, UNIX_EPOCH};

    let mut buf = rand::random::<u128>() & 0xFFF3FFFFFFFFFFFFFFF;

    // 48 bits unix timestamp in ms
    buf |= SystemTime::now()
        .duration_since(UNIX_EPOCH)
        .expect("SystemTime before UNIX Epoch")
        .as_millis() << 80;

    // version
    buf |= 0x7 << 76;

    // variant
    buf |= 0b10 << 62;
    
    uuid::Uuid::from_u128(buf)
};

Edit: I thought I should really emphasize that you should NOT use the above snippet if you intend to generate UUIDs in a distributed fashion or in batches.

FYI, I stumbled upon a rust implementation: https://github.com/LiosK/uuid7-rs

I believe the main challenge will be integrating UUIDv7 with the existing Timestamp struct rather than the implementation itself.

The Timestamp type is pretty opaque, so I think we should be able to come up with a scheme that can support the various shaped timestamps in the new formats. If need be we could also come up with an alternative type and provide conversions between them.

FYI, I am not currently able to spend much time on this (or open-source in general). If someone want to take a stab at this, please do :-)

@KodrAus - Regarding your sketch for *Precision, it seems to me that the only Uuid type that needs to support an arbitrary precision for time and sequence is v8, and I have doubts about that.

It seems to me that introducing Time and Sequence Precisions introduce a fair amount of complexity for only 2 choices for precision. V6 is already implemented, and V7 is always fixed at milliseconds, so there isn't much of a reason to supply a variable-precision timestamp. In addition, I don't think we can manage construction generically for the different uuid types, consdering that v1 and v6 use the funky ticks offset.

For v8, to be spec-compliant, we don't actually need to provide any functionality for v8 other than accept a 16 byte buffer that we distribute into Bytes at the correct offsets.

I started implementing V8 helper types (e.g. construct a new v8 Uuid from now()) but, frankly, there are an awful lot of variables that would be hard to account for:

  1. Distributed Differentiator: Most Uuids will still want a host designation and/or randomness. In order to ensure that a Uuid is truly unique across a distributed system, you need some differentiator other than clock + sequence.
  2. When to start the Epoch? Why start IDs at seconds since Unix Epoch, since we'd be dedicating 32 bits of information to a number range that we'll never use. We could start the epoch on Jan 1 of 2022, provide 32 bits, and have capacity for the next 120 years.
  3. How to distribute the rest of the things?

I guess my point is that V8 is left blank for a reason, and if we took a guess at what people wanted, we'd probably be overlapping quite a bit with the v7 use case.

So I'm proposing that we...

  1. Change Timestamp per your suggestion, make it store 64bit secs and 32bit nanos (sequence is irrelevent, yes?)
  2. (Re)implement v1, v6, and v7 to apply their Version-specific things to the supplied timestamp upon construction.
  3. Create a v8 whose only constructor is basically from_buffer([u8; 16]) and maybe from_fields(u64, u16, u64) and leave the rest up to the user.

If you're cool with this, I'm happy to implement it.

Then, in the future..

If we want to add a really fancy UuidV8Builder which can take all of the above into consideration (e.g.

   let foo = Builder::new()
   .set_random_bits()
   .set_seconds_bits()
   .set_frac_bits()
   .set_frac_divisor()
   .set_hostname()
   .ignore_unused_bits()
   .build();

Oh, also, the v7 spec sort of punts on the counter/sequent and how to do monotonicity. Its says that 1000 Uuids created in a batch should be sorted in order. It doesn't say how, but it does offer suggestions.
I think the best way would be to apply a 10 bit (or so) counter to most significant bits of the random number portion of the random bits. So instead of 74 bits of randomness, it'd be 64. (or to fit more neatly into the bit segments.. 12 and 62)

After implementing the changes I mentioned above. The API will change because I've separated Counter from Timestamp (which I think is the correct approach). A Timestamp shouldn't know anything about a Counter, but a counter could/should know about a Timestamp.

Also there are some legacy things in here, such as being edition 2018 instead of 2021. Also, it uses the atomic crate but the reason for it is no longer valid, since std::sync::atomic now has those types that we need.

Also, std::time is standardized, so we can add helper functions based on SystemTime (not that this would be a breaking change, we would just add new methods)

IMO, we should bump the version to 2.0 and modernize things a bit more.

Thoughts?

Another fun side-effect of the added versions/features:

the cargo hack powerset tests fail by running the device out of space. Too many tests now with the 3 added version features.

Thanks for all the work you put in to this @rrichardson! 🙇 I'm glad the RFC moved away from the original variable-precision timestamp they had for v7 to a much simpler milliseconds-since-unix-epoch. I've papered over some of the issues with Timestamp and ClockSequence for these new formats by adding short-hands for generating UUIDs from the current system time. So instead of having to call Uuid::new_v7(Timestamp::now(NullSequence)), you can call Uuid::now_v7(). The same goes for version 1 and 6 UUIDs.

Before we release this, I would like to gate it behind our RUSTFLAGS cfg for unstable things for the time being, just to be sure the formats are pretty well fixed now. As I understand it, the scope of the draft has now been extended to include rewriting RFC4122, so I don't expect things to change much, but don't know enough about that process to know when things reach their steady-state.

The first release on crates.io that supports v6-v8 UUIDs is published as 1.2.0. Until we’re confident the spec is fixed enough that our APIs won’t need to change you need to set the --cfg uuid_unstable RUSTFLAGS as well as their feature gates to enable them. Once we are confident the RUSTFLAGS requirement will go away. I’m expecting maybe the first draft of the new RFC with its broader scope to rewrite all of RFC4122 (I think that has a deadline around March of 2023?) could be that point if things don’t change much in the meantime.

I'll update the OP to include a link, but it looks like the working group responsible for the new UUID RFC is working here

cc @KodrAus

--cfg uuid_unstable is very much unusable for workspaces. Workspaces ignore config files from each project therefore invalidating configs like:

[build]
rustflags = ["--cfg uuid_unstable"]

Isn't there an alternative way to allow using those features other than manually setting this?

@ConsoleTVs You can put rustflags in a top-level .cargo/config.toml to have them respected by all crates in your workspace:

[build]
rustflags = ["--cfg", "uuid_unstable"]

As soon as we're confident that these new versions have a stable standard we'll be able to remove that required --cfg.

Ah, had tried it before but didn't work. What's the difference between those two?

rustflags = ["--cfg uuid_unstable"]
rustflags = ["--cfg", "uuid_unstable"]

I think —cfg uuid_unstable will be passed to rustc as a single commandline argument, but what it really wants is two arguments: the first being —cfg, the second being the config itself. That might have been why it wasn’t working before 🙂

Indeed the first one crashes. Interesting. Your solution works, thank you!

Would anybody happen to know how we can the status of the current IETF RFC? From memory it had some deadline around March, but I'm not familiar with how their processes work. Would asking on their mailing list be an appropriate thing to do?

@KodrAus between the issue tracker https://github.com/ietf-wg-uuidrev/rfc4122bis and the mailing list https://mailarchive.ietf.org/arch/browse/uuidrev/?q=draft-ietf-uuidrev-rfc4122bis it seems like they're just nitpicking the formatting and such. Maybe you could ping @kyzer-davis who seems to be heading it up.

Since they're currently focusing on serious issues like this

image

I'd have to imagine that the implementation details aren't going to change any further. Are you considering removing the uuid_unstable gate?

@tgross35 As soon as the standard is fixed I'd love to get rid of the uuid_unstable gate. It seems unlikely that it will at this point, but while there's still technically some chance I think we should play it safe.

@KodrAus today the RFC maintainer posted some updates for the new recently released draft, I'm not sure if there's anything relevant there: uuid6/prototypes#42

Do you have any idea which version of the draft this crate supports? I'll submit a PR to the readme for that repo to add this to the implementations table, unless you prefer to DIY https://github.com/uuid6/prototypes

@tgross35 Thanks for checking in to this. I think we based our implementation off the final version of the draft before its scope widened to rewriting RFC 4122 🤔 We should keep up-to-date with whatever the current version is though.

If you'd like to submit a PR that adds uuid to the table that would be great!

Thanks @tgross35 for following up on things! 🙇 This is where things are at right now. It seems like we're getting close to having these new formats stabilized in more or less their current form.

Any progress on the stabilization of UUID v6-v8?

@photino i think there will be a few more months before the RFC becomes officially accepted. Close, but not there yet - status at https://mailarchive.ietf.org/arch/browse/uuidrev/?q=draft-ietf-uuidrev-rfc4122bis

The RFC status has become Proposed Standard. Could we consider stabilizing the v6-v8 features?

@photino I think proposed standard is the maturity level we were blocking on, so once #717 is resolved I think we can finally remove the stability gate.

I think the test vectors (appendix A) changed for one of the new versions at some point. I'm not sure if related tests exist in this repo, but maybe they are worth pulling in?

I'm not sure where benchmarks are used, but one for v7 would be interesting to see how it compares to v4

I think the test vectors (appendix A) changed for one of the new versions at some point.

Oh well spotted. It looks like the v6 vector changed. I think I prefer our current test case where the timestamps in v1 and v6 are the same so you can see how the same timestamp encodes differently. As far as I can tell the way we encode v6 timestamps is still correct, but is the sort of thing that would be great to get a second opinion on before we ship 🙂

I'm not sure where benchmarks are used, but one for v7 would be interesting to see how it compares to v4

In my stabilization PR I added a benchmark you can run to evaluate the impact of switching from v4 to v7. For what it's worth, on my Windows desktop here I'm currently seeing results like:

test new_v4 ... bench:          28 ns/iter (+/- 1)
test now_v7 ... bench:          47 ns/iter (+/- 1)

I think the test vectors (appendix A) changed for one of the new versions at some point.

Oh well spotted. It looks like the v6 vector changed. I think I prefer our current test case where the timestamps in v1 and v6 are the same so you can see how the same timestamp encodes differently. As far as I can tell the way we encode v6 timestamps is still correct, but is the sort of thing that would be great to get a second opinion on before we ship 🙂

It looks correct to my eyes. But maybe it would be good to just add a test with the exact v6 and v7 examples from the RFC as a sanity check?

I'm not sure where benchmarks are used, but one for v7 would be interesting to see how it compares to v4

In my stabilization PR I added a benchmark you can run to evaluate the impact of switching from v4 to v7. For what it's worth, on my Windows desktop here I'm currently seeing results like:

test new_v4 ... bench:          28 ns/iter (+/- 1)
test now_v7 ... bench:          47 ns/iter (+/- 1)

🤦 I was thinking the existing benchmark was for v1 and not v4, couldn't figure out why the difference was so large. Thanks for the results!