tc39 / proposal-uuid

UUID proposal for ECMAScript (Stage 1)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Investigate non-v4 UUID usage

ctavan opened this issue · comments

The initial discussion in #3 brought up the question whether the UUID standard module should support UUID versions other than v4 from the beginning.

A rough analysis of the BigQuery Github dataset revealed revealed the following usage:

Row version cnt   ratio 
1   v4      18318 0.797
2   v1      4399  0.191
3   v5      231   0.010
4   v3      29    0.001

@littledan made the hypothesis that some part of v1 UUID usage could actually be caused by developers having accidentally chosen the wrong UUID version since v1 sounds much more like the "default" UUID type rather than v4.

To verify this hypothesis I suggest to look a bit deeper into the BigQuery Github dataset and:

  • Extract the repos that make use of v1
  • Sort them by github stars or watch count
  • Pick the most prominent modules and check how and why they use v1 UUIDs
  • Potentially get in touch with the maintainers of these modules to clarify why v1 is being used instead of v4

We could do the same for v3/5.

Does this make sense to you? Anything you would add or do differently? Would be great to get some feedback before I actually start working on this 🙂.

It's worth noting that the README for uuid has the v1 incantation shown first, with the v4 incantation shown "below the fold", so that may be a factor as well. That said, I'd be surprised if a statistically significant # of projects were making this decision by mistake. They'll have a reason for the v1/v4 choice, even if that reason may not be very defensible (e.g. "v1 ids must be more unique than v4 because their time-based").

Question: How much of the v1 usage would have to be a mistake for it to have an impact on our decision making? (E.g. even if half of it was erroneous and we adjusted for that with a 90% v4 to 10% v1 distribution... does that change anything? My initial thought is, "Nope. We still need to design an API that allows for other versions, 'cause we're going to have to support them eventually.")

It's worth noting that the README for uuid has the v1 incantation shown first, with the v4 incantation shown "below the fold", so that may be a factor as well.

Yeah, this is where the idea comes from: Developer gets told "use UUIDs", finds uuid module, stops reading at v1, uses it.

That said, I'd be surprised if a statistically significant # of projects were making this decision by mistake.

I do hope so but given the broad adoption of JavaScript these days in so many business areas where there may be less quality control than we might be used to from our work environments I don't exclude the possibility that actually a significant number of projects was choosing the wrong UUID version.

They'll have a reason for the v1/v4 choice, even if that reason may not be very defensible (e.g. "v1 ids must be more unique than v4 because their time-based").

Should we identify such occurrences I'd like to get in touch with the maintainers of these modules to hear their arguments.

Question: How much of the v1 usage would have to be a mistake for it to have an impact on our decision making? (E.g. even if half of it was erroneous and we adjusted for that with a 90% v4 to 10% v1 distribution... does that change anything? My initial thought is, "Nope. We still need to design an API that allows for other versions, 'cause we're going to have to support them eventually.")

I had the chance to meet some TC39 folks in Berlin last week and my major lesson learned there was that we'll have to underpin every single design decision with rock solid arguments in order to eventually reach consensus with our proposal.

So first of all, no matter the outcome of my research, I still believe that our API should be be open to allow for v1/3/5. However should we figure out that a considerable amount of v1 usage was actually by mistake I believe that this would be a really strong argument for emphasizing v4 UUIDs as the default case in our API.

In #3 the very first assumption I suggested was that the API should be symmetric in the different versions, but given the thoughts from this thread I think it might actually be beneficial to nudge users towards v4 and leave v1/3/5 to those users who really need those UUID variants and know what they are doing.

I think thoughts like these ("what did we learn from the userland module and can do better in a standard library") might ultimately be of big importance for the success of this proposal.

The method @ctavan describes above sounds amazingly good. I'm really excited to hear what you find!

Here are the top 100 github repos (by watch count) that make use of uuid.v1():

repo_name								watch_count
https://github.com/TryGhost/Ghost					2045
https://github.com/sequelize/sequelize					1445
https://github.com/getguesstimate/guesstimate-app			1442
https://github.com/gatsbyjs/gatsby					992
https://github.com/pouchdb/pouchdb					922
https://github.com/aws/aws-sdk-js					355
https://github.com/nteract/hydrogen					238
https://github.com/NoRedInk/take-home					202
https://github.com/azukiapp/azk						196
https://github.com/cowbell/sharedrop					161
https://github.com/volumio/Volumio2					153
https://github.com/lipp/doclets						149
https://github.com/forsigner/web-fontmin				144
https://github.com/jitsi/jitsi-meet					139
https://github.com/watson-developer-cloud/visual-recognition-nodejs	134
https://github.com/madrobby/microjs.com					127
https://github.com/tststs/atom-ternjs					123
https://github.com/skevy/graphiql-app					121
https://github.com/ryanfitz/vogels					111
https://github.com/rgbkrk/atom-script					109
https://github.com/biggora/caminte					107
https://github.com/ddsol/redux-schema					105
https://github.com/naomiaro/waveform-playlist				88
https://github.com/o2team/athena					78
https://github.com/yishn/Sabaki						72
https://github.com/CloudBoost/cloudboost				68
https://github.com/svenanders/react-breadcrumbs				65
https://github.com/jfhbrook/wzrd.in					56
https://github.com/michaelgrosner/tribeca				56
https://github.com/Microsoft/tfs-cli					56
https://github.com/mhart/kinesalite					54
https://github.com/firebase/firebase-tools				52
https://github.com/skale-me/skale-engine				50
https://github.com/apache/cordova-windows				47
https://github.com/tomatau/breko-hub					47
https://github.com/fisch0920/snapchat					46
https://github.com/LeanKit-Labs/wascally				46
https://github.com/waterlock/waterlock					43
https://github.com/fluuuid/codedoodl.es					42
https://github.com/joeferraro/MavensMate				41
https://github.com/heroku/heroku-cli					41
https://github.com/Microsoft/vso-agent					40
https://github.com/almin/almin						36
https://github.com/zubairq/AppShare					35
https://github.com/01org/intel-iot-services-orchestration-layer		33
https://github.com/botwiki/detective					33
https://github.com/strathausen/dracula					33
https://github.com/Wyliodrin/WyliodrinSTUDIO				32
https://github.com/kn9ts/project-mulla					31
https://github.com/buggerjs/bugger					30
https://github.com/XadillaX/aliyun-ons					30
https://github.com/hexparrot/mineos-node				30
https://github.com/cfsghost/lantern					28
https://github.com/ekristen/node-module-registry			27
https://github.com/maritz/nohm						26
https://github.com/arcseldon/react-babel-webpack-starter-app		25
https://github.com/mateodelnorte/servicebus				24
https://github.com/ethereumjs/keythereum				22
https://github.com/avoidwork/rozu					22
https://github.com/willwhitney/hydrogen					21
https://github.com/buildkite/frontend					20
https://github.com/codeforamerica/streetmix				20
https://github.com/medihack/redux-pouchdb-plus				20
https://github.com/mostafab/neg5					19
https://github.com/bjrmatos/electron-workers				18
https://github.com/BreeeZe/rpos						17
https://github.com/JackGit/material-ui-vue				17
https://github.com/crubier/react-graph-vis				16
https://github.com/cgmartin/ReadingBuddies				15
https://github.com/flatsheet/flatsheet					14
https://github.com/webofthings/webofthings.js				13
https://github.com/lookify/cassandrom					13
https://github.com/artsy/metaphysics					13
https://github.com/chrisxclash/play-midnight				13
https://github.com/matveyco/8biticon					12
https://github.com/dhawalhshah/class-central				12
https://github.com/marklogic/marklogic-samplestack			11
https://github.com/linkeddata/ldnode					11
https://github.com/stormpath/stormpath-sdk-node				11
https://github.com/webcredits/webcredits				10
https://github.com/slidewiki/slidewiki-platform				10
https://github.com/dloa/alexandria-librarian				10
https://github.com/danielstjules/redislock				10
https://github.com/mhzed/wstunnel					9
https://github.com/cozy/cozy-emails					9
https://github.com/JaredHawkins/TweetGeoViz				9
https://github.com/HongjianLi/istar					8
https://github.com/kisonecat/ximera					8
https://github.com/CiroArtigot/ababool					8
https://github.com/crosswalk-project/crosswalk-app-tools		8
https://github.com/NetEase/pomelo-rpc					7
https://github.com/AuspeXeu/openvpn-status				7
https://github.com/heroku/cli						7
https://github.com/codecapers/mongodb-rest				7
https://github.com/eduardoboucas/jekyll-discuss				7
https://github.com/ISBX/isbx-loopback-cms				7
https://github.com/fruum/fruum						7
https://github.com/SamVerschueren/aws-lambda-mock-context		7
https://github.com/roman01la/RTVideo					6
https://github.com/GabiGrin/ElmFiddle.io				6

And these are the ones that make use of uuid.v3() or uuid.v5() and have a watch count of at least 2:

repo_name						watch_count
https://github.com/gatsbyjs/gatsby			992
https://github.com/nteract/hydrogen			238
https://github.com/jitsi/jitsi-meet			139
https://github.com/e14n/pump.io				112
https://github.com/foxhound87/rfx-stack			93
https://github.com/fisch0920/snapchat			46
https://github.com/Wyliodrin/WyliodrinSTUDIO		32
https://github.com/willwhitney/hydrogen			21
https://github.com/buildkite/frontend			20
https://github.com/marklogic/marklogic-samplestack	11
https://github.com/NuSkooler/enigma-bbs			10
https://github.com/c3subtitles/L2S2			7
https://github.com/fnogatz/CHR.js			6
https://github.com/actionnick/exposure			4
https://github.com/laundree/laundree			3
https://github.com/koding/kite.js			2

BTW there may be still quite some false positives in these results. I'll pick some that sound interesting to me and check the source manually.

Just did a quick check on gatsby and ghost and they indeed use v1 UUIDs, however at a first glance it looked like they could use v4 uuids equally well…

I'll dig deeper over the next days and also make sure to provide all my queries for review in #7.

The initial analysis has been performed in #16 and more in-depth investigations for v1 UUIDs will follow in #19 so I think we can just continue there.

@ctavan thank you for doing this. as per @codehag's out of band feedback; perhaps a next good step, before we dig too deep into the data you've collected, is coming up with a set of hypothesis that we're trying to prove or disprove.

If we take on the qualitative approach of reaching out to a few folks using UUID v1, etc., we also need to be careful not to come in with a bias point of view.

@bcoe agree.

So let's first try to characterize the 3 classes of UUIDs that are described in the RFC:

  • v4 completely random, very simple algorithm, 122 random bits
  • v1 time-based, complex algorithm, if not used carefully, collisions are much more likely than with v4. UUIDs are time-ordered.
  • v3/5 namespace-based, special use case. constant input leads to constant output.

Hypothesis: Following the principle of least surprise the hypothesis is that you should always use the simplest UUID version that fulfills your use case because this reduces the risk of unexpected problems. So if all you need is a unique identifier, you should always use v4 UUIDs. Only if you need time-ordering you should use v1. And only if you need namespacing, you should use v3/5.

In particular, accidentally using v1 instead of v4 UUIDs in cases where the developer is simply expecting a random value but is not aware of the fact that the generated IDs are time-ordered can have very negative consequences:

Given these assumptions we want to understand whether:

  1. the various versions of UUIDs are being used according to their properties which would favor an API, that is symmetric in the UUID versions like the current uuid npm module
  2. or if we should correct for accidental misuse by designing an API that favors v4 UUIDs over the other UUIDs.

Do you agree with these ideas?

Non-v4 UUID usage has been analyzed in detail in https://github.com/bcoe/proposal-standard-library-uuid/blob/master/analysis/README.md

Let's close this one and wait for explicit feedback on the analysis to figure out if additional research is necessary.

The analysis looks great. I'm very happy to see the responsible approach taken here, including following up with the inappropriate users of UUID v1. Seems like we can conclude that the UUID standard library should only support v4.