stan-dev / loo

loo R package for approximate leave-one-out cross-validation (LOO-CV) and Pareto smoothed importance sampling (PSIS)

Home Page:https://mc-stan.org/loo

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inconsistent Pareto k-values for SIS?

fweber144 opened this issue · comments

I noticed that there might be an inconsistency in the Pareto k-values between different approaches for standard importance sampling (SIS):

library(loo)
log_ratios <- -1 * example_loglik_array()
log_ratios <- log_ratios[1:3, , ]
r_eff <- relative_eff(exp(-log_ratios))

# Call psis():
psis_result <- psis(log_ratios, r_eff = r_eff)
# In fact, SIS was used (due to the small number of draws):
lw_sis <- apply(log_ratios, 3, as.vector)
lw_sis <- sweep(lw_sis, 2, apply(lw_sis, 2, matrixStats::logSumExp))
stopifnot(all.equal(weights(psis_result), lw_sis,
                    tolerance = .Machine$double.eps))

# Now request SIS explicitly:
sis_result <- sis(log_ratios, r_eff = r_eff)
# The (log) weights are as expected:
stopifnot(all.equal(weights(sis_result), lw_sis,
                    tolerance = .Machine$double.eps))

# However:
table(pareto_k_values(psis_result))
## Inf
##  32
table(pareto_k_values(sis_result))
##  0
## 32

The point is that calling psis() with a small number of draws will cause the Pareto smoothing not to take place. Instead, SIS is used, as demonstrated above. In that case, the Pareto k-values are Inf. When using sis() explicitly, the Pareto k-values are 0.

Background: In projpred, it is possible (although not encouraged and in particular, this is not the default behavior) to use PSIS-LOO CV with the search being excluded from the CV (validate_search = FALSE) and a small number of thinned draws. In principle, projpred could use sis() explicitly in such a case (and then either continue with the Pareto k-values which are all 0 or even skip the Pareto k checks), but that requires to catch the "small S" case manually (which is not a problem, but if loo changes anything in its "small S" decision rule in the future, this would require adapting projpred's decision rule analogously). Using psis() would be more straightforward, but then we have Pareto k-values which are Inf, which would trigger warnings in the Pareto k checks.

Hmm yeah that inconsistency is unfortunate. I’m not sure what to do about it. Any ideas?

One thought: in hindsight it would actually probably make more sense for the Pareto k values to be NA and not 0 or Inf in all cases when Pareto smoothing isn’t done. But doing that (or changing between 0 and Inf) at this point wouldn’t be backwards compatible for users who were extracting and using the stored k values in the psis object (that’s certainly a minority of users though).

Yeah, the backwards compatibility is definitely an issue. Perhaps the new behavior could be implemented conditional on a global option which needs to be set to TRUE explicitly. When doing this, the current behavior could also be deprecated, so that in some later release, only the new behavior is available. For implementing the new behavior, #137 might be helpful.

In any case, since I wouldn't want projpred to require the most recent loo version where that new behavior is optionally available, I will try to find some other solution for projpred (either catching the "small S" case manually or perhaps using capture.output() to get the warnings from psis() and then checking for the "small S" case there). So this issue here isn't too urgent from my side.

For projpred, I guess checking whether all Pareto k-values are Inf (and regarding them as something like NA) is unsafe as well, right? (I don't know if Pareto smoothing could result in all-Inf Pareto k-values as well.)

For projpred, I guess checking whether all Pareto k-values are Inf (and regarding them as something like NA) is unsafe as well, right? (I don't know if Pareto smoothing could result in all-Inf Pareto k-values as well.)

I think that would almost always be ok, but there could be annoying edge cases where that would be unsafe.

Yes, I see. Thank you!

(btw yesterday I added you to the r-packages team in the stan-dev GitHub organization. I should have done that a while ago!)

Perhaps the new behavior could be implemented conditional on a global option which needs to be set to TRUE explicitly. When doing this, the current behavior could also be deprecated, so that in some later release, only the new behavior is available. For implementing the new behavior, #137 might be helpful.

Yeah that’s definitely a possibility.

Another possibility is to just make the change (breaking backwards compatibility) and call it loo v3.0. I’d be more ok with that if we do a major version number increase. But usually a major version release has major new functionality or other big changes that we don’t really have at the moment.

(btw yesterday I added you to the r-packages team in the stan-dev GitHub organization. I should have done that a while ago!)

Yes, I saw that. Thank you very much!

Another possibility is to just make the change (breaking backwards compatibility) and call it loo v3.0. I’d be more ok with that if we do a major version number increase. But usually a major version release has major new functionality or other big changes that we don’t really have at the moment.

Yes, I agree that this would not be worth a new major version (3.x.x).

Btw, I just realized that the capture.output() solution is probably not a good one, because testthat::test_that() for example somehow manages to redirect warning messages (probably so that it can throw them itself) so they are not captured correctly by capture.output(). And if testthat::test_that() is able to do that, I guess other functions/packages can do that, too.

The point is that calling psis() with a small number of draws will cause the Pareto smoothing not to take place. Instead, SIS is used, as demonstrated above. In that case, the Pareto k-values are Inf. When using sis() explicitly, the Pareto k-values are 0.

I don't understand how they can be different as the diagnostic is done before smoothing and thus whether smoothing is done or not should not affect the diagnostic values.