tc39 / proposal-uuid

UUID proposal for ECMAScript (Stage 1)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

API considerations (now & future)

ctavan opened this issue · comments

In order to kick off some discussion on the API I wanted to start collecting some thoughts:

  • The interface should be symmetric in the different UUID versions, i.e. none of the supported UUID versions should be treated with preference (like having uuid() generate a vX UUID by default).
  • For the uuid npm module it made sense to allow deep import of the different uuid version methods, e.g. to allow reducing bundle size when used in the browser and only a certain type of UUIDs was needed. Do we already have an idea of how the technical implementation of standard modules will look like?
  • Do we need some sort of class representation of UUIDs like it's done in Python or Java?
    • Do we need parsing of existing UUIDs?
    • Do we need validation of existing UUIDs?
  • Any arguments for offering an async api?

What am I missing?

@ctavan , thanks for getting this conversation going again. I think your questions are spot on, especially the bit about having a class-based API. The more I think about this, the more I like that idea, so I found myself taking a stab at what it might look like.

Main things to notice:

  • Default export is UUID class
  • Only the one class - no version-specific subclasses
  • Static methods for generating different versions of UUIDs. (Start with version 4 for now)
  • Built-in validation. Constructor does not allow non-RFC UUIDs to be instantiated.
  • Validation is liberal, allowing for any RFC-valid UUID (i.e. users can create their own UUIDs however they want, regardless of what versions std:uuid supports)
  • Getters for parsing out various fields on-demand.
  • Setting fields is not supported. (Future, though?)

To address your comments (and explain some of the above) ...

  • The interface should be symmetric in the different UUID versions, i.e. none of the supported UUID versions should be treated with preference (like having uuid() generate a vX UUID by default).

💯 agree. Hence, static factory methods.

For the uuid npm module it made sense to allow deep import of the different uuid version methods

This makes sense (kind of) for packaged modules, but if this is going to be built-in the community is probably better served by going with a less contentious practice. Hence, single, top-level export.

Do we need some sort of class representation of UUIDs

I think this makes sense. It allows users to work with both the binary and string forms of a UUID. E.g. in the CodePen above, you can create from either form (new UUID(bytes) or UUID.fromString()), and access in either form (uuid.bytes or uuid.toString())

Do we need parsing of existing UUIDs?
Do we need validation of existing UUIDs?

With a class-based API, validation in the constructor is a no-brainer, IMHO.

Any arguments for offering an async api?

No. UUID creation is generally a very fast process. It also tends to be CPU-bound rather than IO-bound, so I don't think an async API buys much. That said, there was one issue where the crypto API in Electron would block for an extended period on startup, but we resolved that and I don't see this is as a significant design consideration.

What am I missing?

I'd like to figure out how to handle the wonky timestamps in version 1 UUIDs. uuid has the completely ad-hoc msecs and nsecs options to work around the issue there but ideally we wouldn't expose that as part of any future API. The BigInt type provides an elegant solution but is still experimental. Not really sure how that fits into our thinking on this, though.

👋 digging myself out form a giant mountain of work related to dropping Node 6 at Google, will do my best to pull this conversation into the README we're working on.

@littledan's advice was that we concentrate on fleshing things out in the README, don't worry about tests our the formal API definition; we can basically start with pseudo code I think.

Hmm, I'm wondering about some of the aspects here. I like how the uuid module is simple and easy to use, and I'm wondering if we could go even further in this direction, for the default API.

The interface should be symmetric in the different UUID versions, i.e. none of the supported UUID versions should be treated with preference (like having uuid() generate a vX UUID by default).

Why not have a default version which does UUIDv4? Seems like that's what people need most of the time, unless they have a particular need for repeatability (given that we all agree that we'll require a good source of randomness). It would be nice to save people the effort/mistakes by making it hard for people to figure out which one to use.

For the uuid npm module it made sense to allow deep import of the different uuid version methods, e.g. to allow reducing bundle size when used in the browser and only a certain type of UUIDs was needed. Do we already have an idea of how the technical implementation of standard modules will look like?

For the native implementation, the idea would be that it's built into JS, so you don't have to worry about that. We could still make a decision based on implementation techniques for polyfills, though, if we need to.

Do we need some sort of class representation of UUIDs like it's done in Python or Java?

Are users asking for this? I don't think we should provide it just because. Unless developers are really clamoring for more features or messily implementing them themselves, I'd suggest using a function-based API.

The BigInt type provides an elegant solution but is still experimental. Not really sure how that fits into our thinking on this, though.

BigInt is at Stage 3, shipping in Chrome, and implementation is well underway in Firefox and Safari. I plan to propose it for Stage 4 in June. I think it's fine to depend on it.

Why not have a default version which does UUIDv4? Seems like that's what people need most of the time, unless they have a particular need for repeatability (given that we all agree that we'll require a good source of randomness). It would be nice to save people the effort/mistakes by making it hard for people to figure out which one to use.

I think this question almost boils down to asking again whether the library should support anything else than v4 UUIDs at all.

First of all I also assume that the heaviest use case for UUIDs is v4, most likely used as entity identifiers in databases and APIs and I also assume that most people just work with the string representation of these UUIDs, however I don't have any actual data that would support these assumptions apart from my own professional working experience over the past 5 years.

Following my assumptions above I think that a class representation of v4 UUIDs is indeed of rather limited use, after all there's not more than the version/variant information plus randomness in it.

Having a class representation to me really starts making sense when working with v1 UUIDs (and likely v3/v5, although no personal experience here) where the timestamp which is included in these UUIDs is actually useful information that people want to parse and use (the namespace in v3/5 may be of similar interest). The use case that I had back when I contributed the initial implementation of v1 UUIDs (I believe it was in 2011) was primary keys (=unique timestamps) for time series stored in a Cassandra database. While this particular use case seems to be considered bad practice by now (at least according to https://stackoverflow.com/a/17946236) it was at least for me the use case that made me implement this stuff in javascript.

So to approach this question I would ask again: Should this library support all versions of UUID or should we go with just v4? Do we have any data on whether users would miss v1/3/5? How could we gather such data?

My suggestion is that we start with an API that has a single default export of a function (called uuid?) that takes no arguments and outputs a UUID v4 string, and consider in a future proposal a more elaborate class-based API with further versions, more detailed options, parsing support, etc. We can use standards to nudge people towards the design that makes sense in most situations.

start with an API that has a single default export of a function... in a future proposal

I agree that v4 string uuids are the 80-90% use case. But we would be remiss to not consider how other cases (v1, v3, v5, and binary uuids) would dovetail into whatever API we start with. If we start with a default v4 string-uuid function, can we pencil out what a future version of that API that supports v1/v3/v5, parsing, and binary uuids might look like?

BTW, I just had some fun with the BigQuery GitHub Dataset.

Here's what I did:

  1. Find all repos that have a toplevel or lerna-style package.json where uuid or node-uuid is defined as a dependency.
  2. Find all js|ts|jsx files from these repos that are not vendored dependencies (i.e. they don't contain node_modules in their path).
  3. Extract all lines where uuid appears.
  4. Check how often v1/v3/v4/v5 appears in these lines.

Here's the result:

Row version cnt   ratio 
1   v4      18318 0.797
2   v1      4399  0.191
3   v5      231   0.010
4   v3      29    0.001

So you were spot-on with your 80%-guess @broofa 😉

If you want I can share the gcloud project and/or queries with you if you want to dig deeper.

@ctavan this is amazing, sorry I didn't respond earlier (ramping up at Google has been more of an avalanche of work than I expected).

I'm attending TC39 in Berlin right now, and am going to float the initial work we've done on this specification with some of the delegates.

@ctavan @broofa @littledan why don't we start with an API that looks something like:

.uuid()

and we can, in a separate section, point out that this could be extended on to:

.uuid([options])

My thinking is we shouldn't propose an options object out of the gate though, since it could lead to feature creep.

We have two audiences: The naive user who just wants to get to uuid (v4) strings as quickly as possible, and the more advanced user who's going to want "more".

For the first user, I agree that experience should look something like this:

import uuid from 'std:uuid';

uuid(); // => 'b3dd0e96-fb6f-40b3-8fb5-bc0006415712'

I'm fine with that, as long as it doesn't interfere with providing the "more" part for the latter user. And I don't think it does. E.g. Is there any reason not to expose something like the UUID class I sketched out above as a non-default export (at some future date), thusly:

import {UUID} from 'std:uuid';

That's reasonable, right? Not saying this needs to be the future API, only that users wouldn't find this sort of incantation objectionable.

Regarding the options object, what if we just say, "uuid() doesn't take arguments, and never will". Keep it simple. Users that want more than that should use the advanced api (whatever that is).

That will increase demand for the advanced API, but that's not necessarily a bad thing.

I don't think it's a good idea to bake in the assumption that v4 will be the obvious choice for the rest of time just because it happens to be option in most common use in the available data sets right now. It seems like

import { uuidv4 } from 'std:uuid';

is not significantly harder to use and is both more explicit and more future-proof.

(I am not suggesting that other algorithms be supported in the initial proposal, just that the proposal avoid pick one as the default forever.)

@bakkot Did you see the text in the readme and supporting documents explaining the default? Did you see flaws with that reasoning? Even if we support additional UUID types in the future, this seems like a strong default to recommend.

@littledan I saw the analysis directory talking about general background and usage statistics, and this readme entry which talks about usage statistics. If there's other docs, I didn't see them.

Those seem like they make a compelling case for providing v4 and no other things in the initial version of this proposal. They do not seem like they make a compelling case for assuming uuidv4 will always be the correct default (nor do they even appear to try to make that case), especially since explicitly naming the version provided does not seem to me to add much overhead.

I think the evidence there gives good reasoning to have v4 be an opinionated default: the uses of v1 tended to be in error, which is probably encouraged by the API shape of the npm uuid module. I agree that we shouldn't rule out these extensions for the future, though. However, I'm fine to be flexible on this aspect of the API shape.

@bakkot what my analysis of open source projects has shown and what I was trying to summarize in the faq entry is that in fact among the most popular open source projects that were using v1 UUIDs there was only one single project that had an inevitable reason to do so, see this section of the analysis.

In all other cases that I investigated it turned out that developers had chosen v1 rather by accident, mostly because v1 just happens to sound somewhat like a "default" and because v1 uuids are the first ones to be discussed in the npm uuid module documentation.

Now we could of course argue that it is not our duty to ensure that developers read the UUID spec and choose the right algorithm for their purpose. However evidence from the open source project analysis shows that this simply does not happen in practice and that by nudging people into using v4 UUIDs unless they have really compelling reasons may prevent a lot of "wrong" UUID usage in the future. As has been discussed earlier this won't prevent us from offering other UUID algorithms in the future for those developers who really need them.

@ctavan I am convinced that we should not make v1 seem like the correct default, even by just exposing it as one algorithm among several with equal prominence. I am also convinced that v4 in particular should be considered the best practice right now. It doesn't follow that we should assume v4 will be the correct default forever. Maybe there is a good reason to do so, but the README doesn't present one. Historically assumptions that the current best choice of algorithm will remain so forever have not tended to hold up very well, so making that case involves more than just saying that v4 is currently the most popular choice.

@bakkot I agree with your point and I'm confident that we'll be able to further improve our reasoning in the README.

In this FAQ entry we tried to argue that the v1 algorithm has considerable flaws given that nowadays hardware MAC addresses can no longer be considered reasonably unique. It's in fact a great example for your argument of how assumptions about the status quo at the time of writing that RFC in 2005 are simply no longer valid 15 years later.

Assuming that this standard library will be restricted to UUIDs as defined per RFC 4122 and will not be extended to support things like flake-id, nanoid, cuid or ulid, and given the arguments about the irreparable flaws of v1 UUIDs would you then follow the argument that v4 could be presented as a reasonable default?

Or would you suggest opening up the discussion to keep this API open to extensions for unique identifiers even beyond RFC 4122 (which would be counter our current assumptions but maybe worthwhile discussing)?

It seems like folks' points are being missed. It sounds to me like @bakkot is saying that the very existence of a version FOUR means that there will inevitably be a version FIVE, and that it will be strongly recommended over v4 at that time. That suggests that even if v4 is the only good choice right now, it might be better to have no default at all rather than risking future migrations from N to N + 1 being harder.

@ljharb I was indeed not reading @bakkot's argument the way you rephrased it so far.

I was under the assumption that RFC 4122 won't change or be extended but I have to admit that I'm not very familiar with the lifecycle of IETF RFC's and whether that is something to be expected.

Apart from that I believe that your answer also shows why there is so much confusion around UUID version numbering and what these "version" numbers actually mean.

[...] that the very existence of a version FOUR means that there will inevitably be a version FIVE, and that it will be strongly recommended over v4 at that time.

In fact there already are v5 UUIDs and they have been in the RFC 4122 since its publication. The crucial point is, that these "version" numbers have not been assigned in a sense of v2 is the successor of v1 and v3 the successor of v2. Instead, the RFC simply contains different categories of UUID generation algorithms and the resulting algorithms happened to have been numbered in the way they are now for no obvious reasons. Also the language of the RFC calls this numbering "version" which is why we have followed it here. It is misleading if the term "version" is understood as something ever increasing where N+1 is better/newer/more recommended than N. See the first paragraph of the analysis README for a quick overview of the UUID algorithms from the RFC.

If you take a look back at my initial post in this thread you can see that my original assumption was that the API should indeed be symmetric in the different UUID algorithms. Further discussion and the analysis of Open Source repositories convinced me that the UUID RFC is apparently not widely understood and that enough people tend to not dive deep enough into it to pick the right algorithm. This has led us to the idea of promoting v4 UUIDs as a default while leaving the API open for adding other algorithms in the future for those who really need them.

Assuming that this standard library will be restricted to UUIDs as defined per RFC 4122

I'd hope that, if some later RFC replaces or extends that one with additional subtypes of the same variant of UUIDs defined in 4122 (which there is explicitly room to do), then we would consider adding those new subtypes to this library. But yes, I agree it makes sense to scope it to just that RFC (and any future extensions), and that we do not necessarily need to provide all of the variants in the RFC (in particular, I agree that we really should not expose v1).

would you then follow the argument that v4 could be presented as a reasonable default

I think it is a reasonable default now, but the problem with designing things for standard libraries for languages like JS is that they can never be changed. If, 15 years from now, there is a v7 which is considered to be the best practice, it would be unfortunate if import uuid from 'lib:uuid'; uuid() gave you a v4 UUID instead of a v7 UUID. But if we decide that this is what that line should do today, then that's what it will do 15 years from now.

I would like to avoid that situation. One way to avoid it is to say that you have to write import { uuidv4 } from 'lib:uuid'. (Again, I'm not suggesting that we should provide anything other algorithms right now, just that we avoid using the default export.) Another is to argue that there is some convincing reason that v4 will always be the correct default, not just that it is the correct default today; if this is the case we should add that to the readme (keeping in mind that today's usage statistics are not enough). But if we are not totally convinced v4 will always be the correct default, then we should not make it the default now, because we will not be able to change the default in the future.

I'm not very familiar with the lifecycle of IETF RFC's and whether that is something to be expected

RFCs being updated or obsoleted is a normal thing to happen. (See, for example, the TLS 1.3 RFC.)

I don't know if there's a particular reason to expect it to happen in this case (other than the fact that the RFC explicitly reserves four bits for describing the subtype, despite only needing three), but I also don't know of a particular reason to expect it not to happen (though this may, of course, just be ignorance on my part).

If, 15 years from now, there is a v7 ...

Sounds like a good argument for versioning in the standard library. tc39/proposal-built-in-modules#17

That issue to me is a good argument to avoid versioning like the plague, and instead strive to design APIs so they never need breaking changes :-)

I've been mulling this issue over:

  • I strongly feel we should use v4 as the default UUID algorithm
  • However, I agree with @bakkot that we back ourselves into a corner (when a paper comes along that describes am inarguably better UUID algorithm, this will happen.).
  • I don't know the answer just yet, but I think we need to figure out an API that's elegant today, but doesn't paint us into a corner in the future:
    • I'm not, however, advocating we implement version 1, version 3, and Version 5 of RFC 4122; let's concentrate on the algorithm folks currently use.

I'm agreeing with @bakkot's summary of the problem, but am hoping we potentially figure out a better API surface.

My proposed solution to this problem is to make this library expose only a function for generating v4 uuids, and have that function be named something with v4 in the name.

For example,

import { uuidv4 } from 'std:uuid';

would accomplish this (assuming std:uuid had no other exports).

This makes v4 be the default because it's the only thing exposed, but we are in a position to later expand the API to support future best-practice versions as being at least equal citizens to v4. (I don't think we can have a clear default now and a different clear default later, but we can at least have a clear default now and have two equally-surfaced things later.)

This makes v4 be the default because it's the only thing exposed

That doesn't make it the default, that just makes it the only option initially. As soon as there is support for other versions (v1/v3/v5, for example) then it's no longer the default, it's just one of many, and we're back to the same problem we have currently where people use v1 because it's... well... "1".

For the record, I'm not concerned about a newer/better version coming along. It's been 14 years since 4122 was finalized and I'm not aware of any interest or activity going into developing a new version. IMHO, we're at least 10 years out from a compelling alternative. (I know, I know... "famous last words"). If/when something does emerge it's debatable whether it would even fall under the purview of 4122. My money would be on it featuring more bits, making it unsuitable as an extension to the current RFC.

As soon as there is support for other versions (v1/v3/v5, for example) then it's no longer the default

Right, so don't do that.

Sometimes you're looking for a quick and easy random id that you don't care too much about, where UUIDv4 just happens to be a good answer.

A single function API like uuid() that just spits out a new UUIDv4 string everytime is perfect for that.

And sometimes you specifically need UUIDv4. You've got some system that specifies it so you go looking for it.

As someone who tends to know/research which UUID type I want, not having v4 in the name would leave me wondering whether I'm getting the version I need.

It would also feel weird if I needed something other than v4 and nothing from the built-in UUID implementation could be leveraged. I realize that there's almost nothing reusable between the implementations, it's more of an itch than a practical implication.

Maybe randomUUID() or generateUUID() would give enough context to make it clear which type of UUID is being generated while also avoiding setting things up to look like namespace is/was being reserved for future versions.

I personally like the idea of a UUID class with all the trimmings, but I have to admit that I've never had a use for more than to/from bytes (storage savings in bulk quantities). If other flavours of UUID get added later they could be added along with a class that could include a v4 subclass (or whatever) for completeness.

Maybe randomUUID() or generateUUID() would give enough context to make it clear which type of UUID is being generated.

@bakkot @broofa @ctavan, I like @rmg's suggestion that we default to version 4, but call the export randomUUID (we could then have a UUID class that is more general purpose, and could be extended to other versions).

@waldemarhorwat and another peer have raised the point to me that, by the IETF definition of a UUID, a UUID cannot have more entropy than version 4, and still be considered a UUID (it uses the minimal 6 bits for meta-information, the rest is entropy). I think that, as long as we draw attention to the fact that this is an API for a random UUID, there's no danger that a UUID will come along with more randomness.

If at some point a better specification for creating identifiers emerges, this would not be an IETF UUID, I think it would be an outside context problem.

I would be fine with naming the v4 export randomUUID(). Other languages/libraries seem to do be using the term random to describe v4 as well (go, Java, C++ Boost).

I also like the argument, that RFC4122 doesn't allow for more entropy than in v4 UUIDs, which makes it seem like randomUUID is a future-proof name.

BTW while looking at other languages I found a few interesting things:

  • Java provides methods for generating v3 and v4 UUIDs but not v1 or v5 (pretty weird why there's v3 instead of v5 as the RFC already recommends v5 over v3
  • C++ Boost defaults to v5 over v3 for name-based UUIDs but in its implementation anticipates that v5 (which uses SHA-1) for hashing will be followed up by a newer name-based UUID version which will use a different hashing algorithm ("In anticipation of a new RFC for uuid arriving…").
  • Google's implementation for go has chosen v1 to be the "default" export whose generator method is called NewUUID(), whereas the other versions have less defaulty-sounding names: NewRandom() for v4, NewMD5() for v3, NewSHA1() for v5.

randomUUID works for me.

I like @rmg's suggestion that we default to version 4, but call the export randomUUID

To be clear, is the suggestion that the export statement would be as follows:

export default function randomUUID() {...}

... such that:

import uuid from 'std:uuid';  // uuid === randomUUID above
import {randomUUID, ...} from 'std:uuid'; // randomUUID === randomUUID above

@broofa, yes your summary agrees with what I was thinking:

If we go the route of the global namespace, something like:

randomUUID() // as a super-simple interface, that provides version 4 UUIDs.

// and potentially something like this, as an advanced interface:

const uuid = new UUID({version});
uuid();

If we ended up going the module route, something like:

export class UUID {...}
export default function randomUUID() {...}

// such that:

import uuid from 'std:uuid';  // uuid === randomUUID above
import {randomUUID, UUID} from 'std:uuid'; // randomUUID === randomUUID above

@rmg ☝️ does this fit what you were thinking?


I think we could then flesh out the UUID class into a more general API, with randomUUID as a super-simple option, or potentially we start simple with just randomUUID? (similar to @littledan's initial recommendation).

@bcoe @broofa yes to both as interpretations of my suggestion, but over time.

export default function randomUUID() {...}
// TODO: export class UUID {...}

I think the UUID class could be deferred until a later point when there is demand for different flavours of UUID.

The only things I've actually done with UUIDs that isn't directly satisfied by a randomUUID(): String interface is re-formatting the underlying bits to non-canonical formats to optimize for lower level storage/transmission (eg. raw bytes for binary, base64 for text).

commented

Having switched to ULIDs from UUIDs really helped during development because of ability to quickly sort data by id.