kassambara / ggpubr

'ggplot2' Based Publication Ready Plots

Home Page:https://rpkgs.datanovia.com/ggpubr/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Could ggpubr not set random seed

biobee opened this issue · comments

Hi Alboukadel,

ggboxplot set a seed, even we do not add any points or jitter. Seeds are set in:
ggmaplot.R
ggadd.R
utilities.R

For ggboxplot, the seed is set 'globally' in ggadd. Could the jitter-related settings only be executed when jitter is asked for? The seed is set irrespective of the setting for "add".

In detail in the ggadd function:
lines 87-91 set up jitter rrespective of the setting for "add".
Could lines 87-91 be placed after:
if ( "jitter" %in% add ){ on line 115 ?

For transparency, I would propose that setting the seeds in the different functions is done by the user, via the function call at the user end.

Somehing like this could be used to set the seed specified by the user:

  if (!is.na(seed)) {
      new_seed <- sample(.Machine$integer.max, 1L)
      set.seed(seed)
      on.exit(set.seed(new_seed))
    }

Hi all. This would be very useful. I just spent an entire day trying to figure out why some code was producing identical samples from rnorm. In the end I narrowed it down to a single line: calling ggboxplot so came looking into the source.

I'm not at all familiar with R (this is the first code I ever wrote with it, actually - a Shiny app) so the following may be wrong. But I'd suggest:

  • This should at least be documented (maybe it is)
  • The ggplot functions could simply not call set.seed or only do so when being run under a test suite
  • The ggplot functions could save the value of .Random.seed, set their own seed, then restore the original setting.

That's all I can think of for now. Thanks very much for all the efforts and the open source code! :-) I'm of course happy to help test things. Or if you think one of these suggestions is worth implementing, I could have a go and send a pr.

I just wrote and deployed a small Shiny app to illustrate the issue:
https://terrycojones.shinyapps.io/ggboxplot-random-seed-demo/

Also note that set.seed(sample(.Machine$integer.max, 1L)) does not solve the issue because the values coming back from sample are dependent on the system RNG, whose state is being reset by the call to ggboxplot. So calling set.seed in that way after a call to ggboxplot also results in a duplicated stream of random numbers.

Sorry to send so many messages, but this is a really serious problem. I just realized that running a Bayesian (stan) sampling is also affected by this. It takes a seed argument that defaults to sample.int(.Machine$integer.max, 1). That means that any code calling ggboxplot and then running Bayesian sampling will just repeat the exact same analysis. See https://mc-stan.org/rstan/reference/stanmodel-method-sampling.html

This issue will affect any code that's using R's regular RNG methods. It's in general not easy to know if you might be calling such a function (or one that calls such a function). You only find out if you're lucky and happen to notice identical behavior on runs that should have different results.

I don't understand why this hasn't received any attention..... Setting the global R random number generator seed to a constant value has enormous implications for anyone doing any kind of stochastic processing. Anyone doing that who happens to be using this code is going to have silently invalidated results. How can this just be ignored?

TODO:

fixed now, thanks