s-u / uuid

R package for handling UUIDs (based on libuuid)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Creating multiple UUIDs with single function call

minimenchmuncher opened this issue · comments

I'm trying to create lots of UUIDs at the same time (outputting a character vector), preferably with a minimal amount of code. I had noted this while assigning ids to rows in a data frame using dplyr:

dplyr::mutate(df, id = uuid::UUIDgenerate)

which worked great until I noticed that all of the ids I got were identical. I then thought it might work to wrap this in an sapply:

dplyr::mutate(df, id = sapply(1:nrow(df), uuid::UUIDgenerate))

which I think works, but I got a bit concerned about having each of them use the time to generate the uuids considering how close together they would be generated. Is this a real concern?

My one last idea was to do something like this

id <- c(); for (i in 1:nrow(df)) id[i] <- uuid::UUIDgenerate()

in order to not use the clock; however, this may work poorly if my data frame is quite long.
Which of these might be best?

I know it's been a while since you posted this, but since I'm researching using this package as well, I thought I'd pipe in.

Up Front: my initial reading of libuuid1(3) and the OSF DCE 1.1 spec suggests that there are intentional efforts (in the spec) to ensure robust/unique performance even in the presence of sub-100ns resolution UUID generation. This suggests that "time" should be sufficient for most needs. YMMV.

Your first code example is incorrect for two reasons.

### win10
library(dplyr)
mutate(mtcars, id=uuid::UUIDgenerate)
# Error in mutate_impl(.data, dots) : 
#   Unsupported type CLOSXP for column "id"
mutate(mtcars, id=uuid::UUIDgenerate()) %>% head(n=3)
#    mpg cyl disp  hp drat    wt  qsec vs am gear carb                                   id
# 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 0e445fee-4234-11e7-a686-87e63d2b049f
# 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 0e445fee-4234-11e7-a686-87e63d2b049f
# 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 0e445fee-4234-11e7-a686-87e63d2b049f

The first call (your example) is incorrect because you are assigning a function, not its return value. (If a previous version of dplyr supported the first form, my bad, but the premise of that syntax is questionable and not currently supported.) However, even if you add () to the code, the function always returns a single GUID. It is analogous to mutate(mtcars, id=1), obviously not the desired results. You need to call UUIDgenerate() in a way that returns as many GUIDs as you need.

Your third form is essentially the same as your second form (result-wise), but it is far less efficient. I'll stick with the second form.

Your second form is (IMHO) the best for a starting point. Further example code:

### win10
mutate(mtcars, id=sapply(seq_along(mpg), uuid::UUIDgenerate)) %>% head(n=3)
#    mpg cyl disp  hp drat    wt  qsec vs am gear carb                                   id
# 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 988cd97e-4234-11e7-a686-87e63d2b049f
# 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 988cd97f-4234-11e7-a686-87e63d2b049f
# 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 988ced06-4234-11e7-a686-87e63d2b049f
mutate(mtcars, id=sapply(seq_along(mpg), function(a) uuid::UUIDgenerate())) %>% head(n=3)
#    mpg cyl disp  hp drat    wt  qsec vs am gear carb                                   id
# 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 9dd7c718-4234-11e7-a686-87e63d2b049f
# 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 9dd7dab4-4234-11e7-a686-87e63d2b049f
# 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 9dd7dab5-4234-11e7-a686-87e63d2b049f
mutate(mtcars, id = replicate(n(), uuid::UUIDgenerate())) %>% head(n=3)
#    mpg cyl disp  hp drat    wt  qsec vs am gear carb                                   id
# 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 1be738b6-4238-11e7-a686-87e63d2b049f
# 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 1be738b7-4238-11e7-a686-87e63d2b049f
# 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 1be738b8-4238-11e7-a686-87e63d2b049f

The first two seem somewhat similar, except that in the first form you are passing an integer as the first argument to the UUIDgenerate function. The first and only argument to the function is use.time, accepting a logical or NA. Providing an non-zero integer effectively says TRUE (since your integer vector will never be 0, which is equivalent to FALSE), so you will always be using TRUE. This may not be a problem if you don't have a good source of randomness. The first form is therefore not advisable. More the point, if you agree that explicitness in code is generally preferred, you should not be inadvertently passing the integer to the function.

Since we really don't care about the integer being passed, the third call is clearer on the intention (although the time-to-process is nearly identical on my system, so no speed improvement).

Using "time" to generate GUIDs may not be a problem. I want/need to look at the source some more, but take a look at this:

### win10
replicate(3, uuid::UUIDgenerate())
# [1] "300b6180-4235-11e7-a686-87e63d2b049f" "300b6181-4235-11e7-a686-87e63d2b049f" "300b6182-4235-11e7-a686-87e63d2b049f"
length(unique(replicate(1000, uuid::UUIDgenerate())))
# [1] 1000
length(unique(replicate(1000, uuid::UUIDgenerate(TRUE))))
# [1] 1000
length(unique(replicate(1000, uuid::UUIDgenerate(FALSE))))
# [1] 29

It seems from this that by forcing "random" (by saying use.time=FALSE), I have determined that uuid does not have access to a sufficiently-random source, so it uses time. (To me this is a curious development: what constitutes "random-enough"? What about my win10/R-3.3.3 dev system does not have good-enough random? Furthermore, it does not use R's PRNG ... try set.seed(1);uuid::UUIDgenerate(); repeatedly to see.)

Is there concern about using time instead of random? Two points in my mind:

  • Notice that I did 1000 calls to UUIDgenerate(), and it came up with 1000 unique GUIDs. This suggests that time alone will suffice. In a crude benchmark, the calls were on the order of 46μs (win10) or 14μs (ubuntu) each, and since the clock is around 1μs resolution (one ref), I think we're good (for now?).
  • More importantly, the OSF DCE 1.1 spec goes to great pains to ensure time-based UUIDs will not collide. This package relies on the underlying libuuid1 library which adheres to that spec.

In my opinion: using time should provide reasonable assurance of uniqueness. The preferred method is perhaps more opinion-based and context-sensitive. (NB: if this is used in a DBMS, there are allegedly negative ramifications to using GUIDs based on randomness in indexed and/or joinable keys.)

I'm currently having this same issue; I'm trying to replicate(1000, uuid::UUIDgenerate()) but they're coming back non-unique. Is there any workaround?

I didn't like how this package created bulk GUIDs... the GUIDs were technically unique, so fulfilled the point, but they were only different by one character or 2... so it didn't suit my purposes... so I wrote another function instead. Technically 2 functions, but one is internal.


letterCombo <- function(n){
  abc <- "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
  a <- character(n)
  for(i in 1:n){
    b <- sample(1:62,1)
    a[i] <- substr(abc, b, b)
  }
  return(paste(a, collapse = ""))
}


bulkGUID <- function(n = 1000){
  a <- character(n)
  for (i in 1:n){
    a[i] <- paste0(letterCombo(8), "-", letterCombo(4), "-", letterCombo(4), "-", letterCombo(4),
                   "-", letterCombo(12))
  }

  returnie <- a %>% unique

  while (length(returnie) - n != 0){
    returnie[length(returnie) + 1] <- paste0(letterCombo(8), "-", letterCombo(4), "-", letterCombo(4), "-", letterCombo(4),
                   "-", letterCombo(12))
    returnie <- returnie %>% unique
  }
  beepr::beep(9)
  return(returnie)
}

Call it by calling bulkGUID(78) or however many GUIDs you'd like. Feel free to include in your package

@DataStrategist, if you really need a difference of more than 1-2 characters, than it sounds like you need a "V4" UUID, which is completely random. Your approach seems to do that, though I'll say that its performance in the big picture is a bit slow; you can use this function that seems to perform about twice as fast:

bulkUUID <- function(n) {
  nc <- 8L
  fmt <- "%04x%04x%04x%04x%04x%04x%04x%04x"
  stops <- c(1L, 9L, 13L, 17L, 21L, 33L)
  ns <- 6L
  m <- matrix(sapply(seq_len(n), function(ign) sample(65535, size = nc)),
              ncol = nc)
  ret1 <- apply(m, 1,
                function(x) do.call(sprintf, c(fmt, as.list(x))))
  ret2 <- mapply(function(a,b) substr(ret1, a, b), stops[-ns], stops[-1L] - 1L)
  if (n == 1) ret2 <- matrix(ret2, nrow = 1)
  ret <- apply(ret2, 1, paste, collapse = "-")
  ret
}

This can likely be sped up some more, perhaps by controlling how much random data is pulled to generate the matrix. (I also suggest that the use of magrittr::%>% is unnecessary overhead.)

Regardless, this package and repo has not seen any updates in over five years, so I think it unlikely that any changes will be made. These days, I often rely on SQL methods for acquiring large numbers of UUIDs, as DBMSs often have much more efficient engines for this. For instance, postgres has uuid_generate_v1mc() (and uuid_generate_v4(), ref: https://www.postgresql.org/docs/11/uuid-ossp.html).

Thanks @r2evans , it's true, my function isn't performant... especially the "in case of duplicates" part :) And it's true that this is better applied at the db level... unfortunately my db doesn't have this functionality, so I'm actually walking through a stupid pilot to figure out how to get it done the long way. Anywway, thanks, I didn't know about different versions of UUID.

Some thoughts, and you may not have control over them:

  • UUIDs have various types, and the debate about which is best is far-and-wide. Some believe that there is some "data leaking" in V1 (due to the embedded literal MAC address), easily mitigated; others mention that using V4 (truly random) can have an adverse effect on index page tables in the underlying DBMS architecture. I cannot speak to the impact of the second.

  • Many still prefer integer keys, for storage and speed. One issue is that in a clustered (multi-node) setup, it takes more effort to track them (though I believe is not insurmountable). An issue that some raise is that they prefer these keys to be globally-unique, so that one cannot (accidentally?) join on incorrect keys (something that cannot happen with global keys, including but not exclusively UUIDs).

I did some benchmarking on generating UUIDs (a V1 variant) between libuuid, native R, and DBMS solutions:

      n libuuid native postgres sqlite sqlserver
    100       0   0.02     1.23   1.13      0.84
   1000    0.05   0.07     1.13   1.21      1.08
  10000    0.47   3.27     1.35   1.45      1.17
 100000    5.39  39.41     3.10   3.50      2.68
1000000   63.48 338.66    16.61  17.47     16.31

(Units are "seconds".) n is the number of keys generated at once (along the lines of sample(n). While I did not repeat hundreds of times (that would have given better sample data), this gives a reasonable comparison of performance based on the expected use-case.

The postgres and sqlserver are native methods. I have not tested mariadb, oracle, or mysql, but they all have native functions as well. I saw somewhere that DB2 does not support them, but it was an old reference, so I don't know (and have no need to spend time on it).

In the end, even if your database does not have a native function or reasonable stored procedure you can write, you could still use RSQLite for generating your own.

#' @param n 'integer', number of UUIDs to generate
#' @param random1 'logical', whether to use a previously-defined MAC address (defined elsewhere) or generate a truly-random MAC address (V4) with each UUID
#' @param MAC 'character', the left 24 characters (with hyphens) or 20 characters (without hyphens), e.g., "1526BE96-ED5E-7284-3F4F-" or "1526BE96ED5E72843F4F", will be prepended to the rest of the random portion generated by SQLite
#' @param conn database connection object, perhaps from `DBI::dbConnect(RSQLite::SQLite(), ":memory:")`
#' @md
make_UUID_sqlite <- function(n, ..., random1 = FALSE, MAC, conn = NULL) {
  if (!requireNamespace("RSQLite", quietly = TRUE)) {
    stop("'RSQLite' package not available", call. = FALSE)
  }
  withqry <- sprintf("WITH RECURSIVE cnt(x) as (select 1 union all select x+1 from cnt limit %d)", n)
  if (random1) {
    mainqry <- "select (hex(randomblob(6))) as uuid from cnt"
    force(MAC)
  } else {
    mainqry <- "select (hex(randomblob(4))||'-'||hex(randomblob(2))||'-'||hex(randomblob(2))||'-'||hex(randomblob(2))||'-'||hex(randomblob(6))) as uuid from cnt"
    MAC <- ""
  }
  uuids <- DBI::dbGetQuery(conn, paste(withqry, mainqry))$uuid
  return(paste0(MAC, uuids))
}

(Note: if you do not require the hyphens, then you can compact the second mainqry with hex(randomblob(16)) instead of multiple hyphen-separated components. Ever-so-slightly faster.)

A flaw in this approach is that it is completely random, so there is no guarantee of uniqueness. With random1=FALSE, the chances are rather slim (2^128), but with random1=TRUE then this is increased to 2^48. While one can argue that even that is unlikely, it is still feasible. (I have run multiple tests of length(unique(adssql:::make_UUID_sqlite(1000000, random1=T, MAC="", conn=mycon))) and have never found indication of a duplicate. But that does not prove that it cannot happen.

What dbms are you using?

We are using Microsoft Access Web Apps. It's kinda like a kiddy front end for Azure SQL dbs. It's not a great system, but works great for passing authority down. Do you know about those?

Sorry, Access (and similar) are specifically not in my scan for ready solutions (in my experience, SQL in Access databases is too non-standard for my needs).

Please note that this all depends on the capabilities of the OS. This package just wraps libuuid which tries to use whichever facilities are available and the results depend on both the time and randomness sources provided by the OS. So if you see duplicates, please specify exactly the OS, kernel version etc. On my machine (x86_64, macOS 10.14.6, R 3.6.2) I get

> length(unique(sapply(1:100000, function(...) UUIDgenerate(F))))
[1] 100000
> length(unique(sapply(1:100000, function(...) UUIDgenerate(T))))
[1] 100000
> length(unique(sapply(1:100000, function(...) UUIDgenerate())))
[1] 100000

I'd like to be faithful to the libuuid implementation, but if there are modes that can be improved, I'm happy to add them.

@s-u, try it on Windows:

> length(unique(replicate(100, uuid::UUIDgenerate(T))))
[1] 100
> length(unique(replicate(100, uuid::UUIDgenerate(F))))
[1] 6

R.version
#                _                           
# platform       x86_64-w64-mingw32          
# arch           x86_64                      
# os             mingw32                     
# system         x86_64, mingw32             
# status                                     
# major          3                           
# minor          5.3                         
# year           2019                        
# month          03                          
# day            11                          
# svn rev        76217                       
# language       R                           
# version.string R version 3.5.3 (2019-03-11)
# nickname       Great Truth                 

This is a "known" thing: Windows' premise of sufficiently-random is not the same as other OSes. I'm not arguing that uuid needs to fix this bug. My hope is that we might effect something like uuid::UUIDgenerate(n=100) that would do the iteration efficiently (within C/C++) instead of duplicating all of the wrapper stuff for all 100 iterations. Thanks!

@r2evans thanks, I haven't been around anything running Windows for 20+ years so I wouldn't know and, admittedly, Windows is not a priority for libuuid (it's now part of util-linux). On unix a good source of randomness like /dev/urandom is used, but on Windows there is nothing - not even jrand, so the generated random numbers really just depend on the time precision, thus they are bad for repeated calls. I'll have a look if we can use any of the Windows Crypto API to get better randomness, but note that it will most likely slow things down considerably.

As for your original request, I can add the n option, that makes sense.

If you can address the crypto api, that's good, but I would expect it to be a lot more effort than a low-level allowance for n= ... and some more immediate payback and performance gains. Perhaps not just on windows, even.

I envy your successful avoidance of windows. Unfortunately, the integration of Excel for some of my clients is far higher than LibreOffice or similar have successfully/fully mimicked.

Thanks!