HenrikBengtsson / Wishlist-for-R

Features and tweaks to R that I and others would love to see - feel free to add yours!

Home Page:https://github.com/HenrikBengtsson/Wishlist-for-R/issues

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

WISH: Built-in R session-specific universally unique identifier (UUID)

HenrikBengtsson opened this issue · comments

Proposal

Provide a built-in mechanism for obtaining an identifier for the current R session, e.g.

> Sys.info()[["session_uuid"]]
[1] "4258db4d-d4fb-46b3-a214-8c762b99a443"

The identifier should be "unique" in the sense that the probability for two R sessions(*) having the same identifier should be extremely small. There's no need for reproducibility, i.e. the algorithm for producing the identifier may be changed at any time.

(*) Two R sessions running at different times (seconds, minutes, days, years, ...) or on different machines (locally or anywhere in the world).

Use cases

In parallel-processing workflows, R objects may be "exported" (serialized) to background R processes ("workers") for further processing. In other workflows, objects may be saved to file to be reloaded in a future R session. However, certain types of objects in R maybe only be relevant, or valid, in the R session that created them. Attempts to use them in other R processes may give an obscure error or in the worst case produce garbage results.

Having an identifier that is unique to each R process will make it possible to detect when an object is used in the wrong context. This can be done by attaching the session identifier to the object. For example,

obj <- 42L
attr(obj, "owner") <- Sys.info()[["session_uuid"]]

With this, it is easy to validate the "ownership" later;

stopifnot(identical(attr(obj, "owner"), Sys.info()[["session_uuid"]]))

I argue that such an identifier should be part of base R for easy access and avoid each developer having to roll their own.

Possible implementation

One proposal would be to bring in Simon Urbanek's 'uuid' package (https://cran.r-project.org/package=uuid) into base R. This package provides:

> uuid::UUIDgenerate()
[1] "b7de6182-c9c1-47a8-b5cd-e5c8307a8efb"

based on Theodore Ts'o's libuuid (https://mirrors.edge.kernel.org/pub/linux/utils/util-linux/). From man uuid_generate:

"The uuid_generate function creates a new universally unique identifier (UUID). The uuid will be generated based on high-quality randomness from /dev/urandom, if available. If it is not available, then uuid_generate will use an alternative algorithm which uses the current time, the local ethernet MAC address (if available), and random data generated using a pseudo-random generator.
[...]
The UUID is 16 bytes (128 bits) long, which gives approximately 3.4x10^38 unique values (there are approximately 10^80 elementary particles in the universe according to Carl Sagan's Cosmos). The new UUID can reasonably be considered unique among all UUIDs created on the local system, and among UUIDs created on other systems in the past and in the future."

An alternative, that does not require adding a dependency on the libuuid library, would be to roll a poor man's version based on a set of semi-unique attributes, e.g.

make_id <- function(...) {
  args <- list(...)
  saveRDS(args, file = f <- tempfile())
  on.exit(file.remove(f))
  unname(tools::md5sum(f))
}

session_id <- local({
  id <- NULL
  function() {
    if (is.null(id)) {
      id <<- make_id(
        info    = Sys.info(),
        pid     = Sys.getpid(),
        tempdir = tempdir(),
        time    = Sys.time(),
        random  = sample.int(.Machine$integer.max, size = 1L)
      )
    }
    id
  }
})

Example:

> session_id()
[1] "8d00b17384e69e7c9ecee47e0426b2a5"

> session_id()
[1] "8d00b17384e69e7c9ecee47e0426b2a5"

PS. Having a built-in make_id() function would be handy too, e.g. when creating object-specific identifiers for other purposes.

PPS. It would be neat if there was an object, or connection, interface for tools::md5sum(), which currently only operates on files sitting on the file system. The digest package provides this functionality.

See also

FWIW, the startup time and the pid pretty well identify a process. The ps package has code to get the startup time. If you want to use this across machines, then add the hostname or IP address, maybe.

ps::ps_create_time(ps::ps_handle())
[1] "2019-05-20 23:31:52 GMT"

Good point. Didn't think of the process start time(*)

"... maybe", yeah, hostname/IP number may not be very unique - I wonder how many systems end up with the same values there, e.g. ("pi", 192.168.0.2).

(*) In the 'startup' package, I record the "onLoad" time when the startup package is loaded to serve as a proxy for base R not providing such an attribute itself. But, asking the OS about the process startup time is definitely nicer.