Investigate non-v4 UUID usage
ctavan opened this issue · comments
The initial discussion in #3 brought up the question whether the UUID standard module should support UUID versions other than v4 from the beginning.
A rough analysis of the BigQuery Github dataset revealed revealed the following usage:
Row version cnt ratio
1 v4 18318 0.797
2 v1 4399 0.191
3 v5 231 0.010
4 v3 29 0.001
@littledan made the hypothesis that some part of v1
UUID usage could actually be caused by developers having accidentally chosen the wrong UUID version since v1
sounds much more like the "default" UUID type rather than v4
.
To verify this hypothesis I suggest to look a bit deeper into the BigQuery Github dataset and:
- Extract the repos that make use of v1
- Sort them by github stars or watch count
- Pick the most prominent modules and check how and why they use v1 UUIDs
- Potentially get in touch with the maintainers of these modules to clarify why v1 is being used instead of v4
We could do the same for v3/5.
Does this make sense to you? Anything you would add or do differently? Would be great to get some feedback before I actually start working on this 🙂.
It's worth noting that the README for uuid
has the v1 incantation shown first, with the v4 incantation shown "below the fold", so that may be a factor as well. That said, I'd be surprised if a statistically significant # of projects were making this decision by mistake. They'll have a reason for the v1/v4 choice, even if that reason may not be very defensible (e.g. "v1 ids must be more unique than v4 because their time-based").
Question: How much of the v1 usage would have to be a mistake for it to have an impact on our decision making? (E.g. even if half of it was erroneous and we adjusted for that with a 90% v4 to 10% v1 distribution... does that change anything? My initial thought is, "Nope. We still need to design an API that allows for other versions, 'cause we're going to have to support them eventually.")
It's worth noting that the README for uuid has the v1 incantation shown first, with the v4 incantation shown "below the fold", so that may be a factor as well.
Yeah, this is where the idea comes from: Developer gets told "use UUIDs", finds uuid module, stops reading at v1, uses it.
That said, I'd be surprised if a statistically significant # of projects were making this decision by mistake.
I do hope so but given the broad adoption of JavaScript these days in so many business areas where there may be less quality control than we might be used to from our work environments I don't exclude the possibility that actually a significant number of projects was choosing the wrong UUID version.
They'll have a reason for the v1/v4 choice, even if that reason may not be very defensible (e.g. "v1 ids must be more unique than v4 because their time-based").
Should we identify such occurrences I'd like to get in touch with the maintainers of these modules to hear their arguments.
Question: How much of the v1 usage would have to be a mistake for it to have an impact on our decision making? (E.g. even if half of it was erroneous and we adjusted for that with a 90% v4 to 10% v1 distribution... does that change anything? My initial thought is, "Nope. We still need to design an API that allows for other versions, 'cause we're going to have to support them eventually.")
I had the chance to meet some TC39 folks in Berlin last week and my major lesson learned there was that we'll have to underpin every single design decision with rock solid arguments in order to eventually reach consensus with our proposal.
So first of all, no matter the outcome of my research, I still believe that our API should be be open to allow for v1/3/5
. However should we figure out that a considerable amount of v1
usage was actually by mistake I believe that this would be a really strong argument for emphasizing v4
UUIDs as the default case in our API.
In #3 the very first assumption I suggested was that the API should be symmetric in the different versions, but given the thoughts from this thread I think it might actually be beneficial to nudge users towards v4
and leave v1/3/5
to those users who really need those UUID variants and know what they are doing.
I think thoughts like these ("what did we learn from the userland module and can do better in a standard library") might ultimately be of big importance for the success of this proposal.
The method @ctavan describes above sounds amazingly good. I'm really excited to hear what you find!
Here are the top 100 github repos (by watch count) that make use of uuid.v1()
:
repo_name watch_count
https://github.com/TryGhost/Ghost 2045
https://github.com/sequelize/sequelize 1445
https://github.com/getguesstimate/guesstimate-app 1442
https://github.com/gatsbyjs/gatsby 992
https://github.com/pouchdb/pouchdb 922
https://github.com/aws/aws-sdk-js 355
https://github.com/nteract/hydrogen 238
https://github.com/NoRedInk/take-home 202
https://github.com/azukiapp/azk 196
https://github.com/cowbell/sharedrop 161
https://github.com/volumio/Volumio2 153
https://github.com/lipp/doclets 149
https://github.com/forsigner/web-fontmin 144
https://github.com/jitsi/jitsi-meet 139
https://github.com/watson-developer-cloud/visual-recognition-nodejs 134
https://github.com/madrobby/microjs.com 127
https://github.com/tststs/atom-ternjs 123
https://github.com/skevy/graphiql-app 121
https://github.com/ryanfitz/vogels 111
https://github.com/rgbkrk/atom-script 109
https://github.com/biggora/caminte 107
https://github.com/ddsol/redux-schema 105
https://github.com/naomiaro/waveform-playlist 88
https://github.com/o2team/athena 78
https://github.com/yishn/Sabaki 72
https://github.com/CloudBoost/cloudboost 68
https://github.com/svenanders/react-breadcrumbs 65
https://github.com/jfhbrook/wzrd.in 56
https://github.com/michaelgrosner/tribeca 56
https://github.com/Microsoft/tfs-cli 56
https://github.com/mhart/kinesalite 54
https://github.com/firebase/firebase-tools 52
https://github.com/skale-me/skale-engine 50
https://github.com/apache/cordova-windows 47
https://github.com/tomatau/breko-hub 47
https://github.com/fisch0920/snapchat 46
https://github.com/LeanKit-Labs/wascally 46
https://github.com/waterlock/waterlock 43
https://github.com/fluuuid/codedoodl.es 42
https://github.com/joeferraro/MavensMate 41
https://github.com/heroku/heroku-cli 41
https://github.com/Microsoft/vso-agent 40
https://github.com/almin/almin 36
https://github.com/zubairq/AppShare 35
https://github.com/01org/intel-iot-services-orchestration-layer 33
https://github.com/botwiki/detective 33
https://github.com/strathausen/dracula 33
https://github.com/Wyliodrin/WyliodrinSTUDIO 32
https://github.com/kn9ts/project-mulla 31
https://github.com/buggerjs/bugger 30
https://github.com/XadillaX/aliyun-ons 30
https://github.com/hexparrot/mineos-node 30
https://github.com/cfsghost/lantern 28
https://github.com/ekristen/node-module-registry 27
https://github.com/maritz/nohm 26
https://github.com/arcseldon/react-babel-webpack-starter-app 25
https://github.com/mateodelnorte/servicebus 24
https://github.com/ethereumjs/keythereum 22
https://github.com/avoidwork/rozu 22
https://github.com/willwhitney/hydrogen 21
https://github.com/buildkite/frontend 20
https://github.com/codeforamerica/streetmix 20
https://github.com/medihack/redux-pouchdb-plus 20
https://github.com/mostafab/neg5 19
https://github.com/bjrmatos/electron-workers 18
https://github.com/BreeeZe/rpos 17
https://github.com/JackGit/material-ui-vue 17
https://github.com/crubier/react-graph-vis 16
https://github.com/cgmartin/ReadingBuddies 15
https://github.com/flatsheet/flatsheet 14
https://github.com/webofthings/webofthings.js 13
https://github.com/lookify/cassandrom 13
https://github.com/artsy/metaphysics 13
https://github.com/chrisxclash/play-midnight 13
https://github.com/matveyco/8biticon 12
https://github.com/dhawalhshah/class-central 12
https://github.com/marklogic/marklogic-samplestack 11
https://github.com/linkeddata/ldnode 11
https://github.com/stormpath/stormpath-sdk-node 11
https://github.com/webcredits/webcredits 10
https://github.com/slidewiki/slidewiki-platform 10
https://github.com/dloa/alexandria-librarian 10
https://github.com/danielstjules/redislock 10
https://github.com/mhzed/wstunnel 9
https://github.com/cozy/cozy-emails 9
https://github.com/JaredHawkins/TweetGeoViz 9
https://github.com/HongjianLi/istar 8
https://github.com/kisonecat/ximera 8
https://github.com/CiroArtigot/ababool 8
https://github.com/crosswalk-project/crosswalk-app-tools 8
https://github.com/NetEase/pomelo-rpc 7
https://github.com/AuspeXeu/openvpn-status 7
https://github.com/heroku/cli 7
https://github.com/codecapers/mongodb-rest 7
https://github.com/eduardoboucas/jekyll-discuss 7
https://github.com/ISBX/isbx-loopback-cms 7
https://github.com/fruum/fruum 7
https://github.com/SamVerschueren/aws-lambda-mock-context 7
https://github.com/roman01la/RTVideo 6
https://github.com/GabiGrin/ElmFiddle.io 6
And these are the ones that make use of uuid.v3()
or uuid.v5()
and have a watch count of at least 2:
repo_name watch_count
https://github.com/gatsbyjs/gatsby 992
https://github.com/nteract/hydrogen 238
https://github.com/jitsi/jitsi-meet 139
https://github.com/e14n/pump.io 112
https://github.com/foxhound87/rfx-stack 93
https://github.com/fisch0920/snapchat 46
https://github.com/Wyliodrin/WyliodrinSTUDIO 32
https://github.com/willwhitney/hydrogen 21
https://github.com/buildkite/frontend 20
https://github.com/marklogic/marklogic-samplestack 11
https://github.com/NuSkooler/enigma-bbs 10
https://github.com/c3subtitles/L2S2 7
https://github.com/fnogatz/CHR.js 6
https://github.com/actionnick/exposure 4
https://github.com/laundree/laundree 3
https://github.com/koding/kite.js 2
BTW there may be still quite some false positives in these results. I'll pick some that sound interesting to me and check the source manually.
Just did a quick check on gatsby and ghost and they indeed use v1 UUIDs, however at a first glance it looked like they could use v4 uuids equally well…
I'll dig deeper over the next days and also make sure to provide all my queries for review in #7.
@ctavan thank you for doing this. as per @codehag's out of band feedback; perhaps a next good step, before we dig too deep into the data you've collected, is coming up with a set of hypothesis that we're trying to prove or disprove.
If we take on the qualitative approach of reaching out to a few folks using UUID v1, etc., we also need to be careful not to come in with a bias point of view.
@bcoe agree.
So let's first try to characterize the 3 classes of UUIDs that are described in the RFC:
- v4 completely random, very simple algorithm, 122 random bits
- v1 time-based, complex algorithm, if not used carefully, collisions are much more likely than with v4. UUIDs are time-ordered.
- v3/5 namespace-based, special use case. constant input leads to constant output.
Hypothesis: Following the principle of least surprise the hypothesis is that you should always use the simplest UUID version that fulfills your use case because this reduces the risk of unexpected problems. So if all you need is a unique identifier, you should always use v4 UUIDs. Only if you need time-ordering you should use v1. And only if you need namespacing, you should use v3/5.
In particular, accidentally using v1 instead of v4 UUIDs in cases where the developer is simply expecting a random value but is not aware of the fact that the generated IDs are time-ordered can have very negative consequences:
- If these IDs are used as database keys and a database/cache does ID-based sharding, it can lead to "hot shards".
- Developers who unintentionally use v1 UUIDs in public datasets may not be aware of the fact that the creation timestamp and MAC address are leaked. See Wikipedia: “This privacy hole was used when locating the creator of the Melissa virus.”
Given these assumptions we want to understand whether:
- the various versions of UUIDs are being used according to their properties which would favor an API, that is symmetric in the UUID versions like the current uuid npm module
- or if we should correct for accidental misuse by designing an API that favors v4 UUIDs over the other UUIDs.
Do you agree with these ideas?
Non-v4 UUID usage has been analyzed in detail in https://github.com/bcoe/proposal-standard-library-uuid/blob/master/analysis/README.md
Let's close this one and wait for explicit feedback on the analysis to figure out if additional research is necessary.
The analysis looks great. I'm very happy to see the responsible approach taken here, including following up with the inappropriate users of UUID v1. Seems like we can conclude that the UUID standard library should only support v4.