Is it possible to define initial values for the parameter NamedTuple?

Question

Is it possible to define initial values for the parameter NamedTuple?

goedman opened this issue 5 years ago · comments

Hi Tamas,

The very last example in StatisticalRethinkingJulia/DynamicHMCModels.jl never gave problems in DynamicHMC up to v1.0.6 (for reference, the old version used with DynamicHMC v1.0.6 is in m12.6d.jl ) .

In pre-2.0 though, it seems to have difficulty in finding a reasonable stepsize (i.e. typically it sets ϵ < e-12, often way smaller, and never recovers). The updated version of m12.6d.jl is included below in the self-contained example section. Sometimes Optim succeeds, but often it can't find an optimum (if no e.g. Random.seed!(...) is used).

I've looked at the new initialization for q and ϵ and tried a few settings for ϵ, without luck. I tried initialization = (q=(...), ) but that is not correct. I did see a remark that both data and init values can be included in the model function but haven't figured out how to do that (I wanted to use the initial values used in DynamicHMC v1.0.6). That remark was in a comment and might be a left-over from v1.0.6 examples.

The log density function in both versions is identical.

The data for this example is in:

Kline.txt

Please make sure you are using the latest tagged versions and the last stable release of Julia before proceeding. Support for other versions is very limited.

This is using DynamicHMC#master as of today (8/27/2019) on Julia v1.4 or v1.3-RC1.

Self-contained example that demonstrates the problem

using DynamicHMC, LogDensityProblems, TransformVariables, Flux, Random
using Distributions, Parameters, CSV, DataFrames, LinearAlgebra

Random.seed!(012451)

# Read attached Kline.txt file
df = DataFrame(CSV.read(joinpath(@__DIR__, "Kline.txt"), delim=';'));
# Add col logpop, set log() for population data
df.logpop = map((x) -> log(x), df.population);
# Add id for societies
df.society = 1:10;

Base.@kwdef mutable struct KlineModel{Ty <: AbstractVector,
  Tx <: AbstractMatrix, Ts <: AbstractVector}
    "Observations (total_tools)."
    y::Ty
    "Covariates (logpop)"
    x::Tx
    "Society"
    s::Ts
    "Number of observations (10)"
    N::Int
    "Number of societies (also 10)"
    N_societies::Int
end

function make_transformation(model::KlineModel)
    as( (β = as(Array, size(model.x, 2)), α = as(Array, model.N_societies), σ = asℝ₊) )
end

# Instantiate the model with data and inits.

N = size(df, 1)
N_societies = length(unique(df.society))
x = hcat(ones(Int64, N), df.logpop);
s = df.society
y = df.total_tools
model = KlineModel(; y=y, x=x, s=s, N=N, N_societies=N_societies)

# Make the type callable with the parameters *as a single argument*.

function (model::KlineModel)(θ)
    @unpack y, x, s, N, N_societies = model   # data
    @unpack β, α, σ = θ  # parameters
    ll = 0.0
    ll += logpdf(Cauchy(0, 1), σ)
    ll += sum(logpdf.(Normal(0, σ), α)) # α[1:10]
    ll += logpdf.(Normal(0, 10), β[1]) # a
    ll += logpdf.(Normal(0, 1), β[2]) # a
    ll += sum(
      [loglikelihood(Poisson(exp(α[s[i]] + dot(x[i, :], β))), [y[i]]) for i in 1:N]
    )
    ll
end

println()
θ = (β = [1.0, 0.25], α = rand(Normal(0, 1), N_societies), σ = 0.2)
model(θ) |> display
println()

# Wrap the problem with a transformation, then use Flux for the gradient.

P = TransformedLogDensity(make_transformation(model), model)
∇P = ADgradient(:Flux, P);

# Can we initialize? E.g.:

results = mcmc_with_warmup(Random.GLOBAL_RNG, ∇P, 1000;
#  initialization = (ϵ = 0.001, ),
#  warmup_stages = fixed_stepsize_warmup_stages()
)
posterior = P.transformation.(results.chain)

println()
DynamicHMC.Diagnostics.EBFMI(results.tree_statistics) |> display

println()
DynamicHMC.Diagnostics.summarize_tree_statistics(results.tree_statistics) |> display
println()

posterior_β = mean(posterior[i].β for i in 1:length(posterior))
posterior_α = mean(posterior[i].α for i in 1:length(posterior))
posterior_σ = mean(posterior[i].σ for i in 1:length(posterior))

println()
posterior_β |> display
println()
posterior_α  |> display
println()
posterior_σ |> display
println()

Output, expected outcome, comparison to other samplers

stan_result = "
Iterations = 1:1000
Thinning interval = 1
Chains = 1,2,3,4
Samples per chain = 1000

Empirical Posterior Estimates:
                            Mean                SD               Naive SE             MCSE            ESS    
            a          1.076167468  0.7704872560 0.01218247319 0.0210530022 1000.000000
           bp         0.263056273  0.0823415805 0.00130193470 0.0022645077 1000.000000
  a_society.1   -0.191723568  0.2421382537 0.00382854195 0.0060563054 1000.000000
  a_society.2    0.054569029  0.2278506876 0.00360263570 0.0051693148 1000.000000
  a_society.3   -0.035935050  0.1926364647 0.00304584994 0.0039948433 1000.000000
  a_society.4    0.334355037  0.1929971201 0.00305155241 0.0063871707  913.029080
  a_society.5    0.049747513  0.1801287716 0.00284808595 0.0043631095 1000.000000
  a_society.6   -0.311903245  0.2096126337 0.00331426674 0.0053000536 1000.000000
  a_society.7    0.148637507  0.1744680594 0.00275858223 0.0047660246 1000.000000
  a_society.8   -0.164567976  0.1821341074 0.00287979309 0.0034297298 1000.000000
  a_society.9    0.277066965  0.1758237250 0.00278001719 0.0055844175  991.286501
 a_society.10   -0.094149204  0.2846206232 0.00450024719 0.0080735022 1000.000000
sigma_society    0.310352849  0.1374834682 0.00217380450 0.0057325226  575.187461
";

Contributing code for tests

Kline.txt is data from StatisticalRethinking and as such not MIT licensed.

Tamas K. Papp · Answer 1 · Wed Aug 28 2019 14:32:41 GMT+0800 (China Standard Time)

Thanks for the report. I can replicate the problem, and will investigate it soon (today or tomorrow). There seems to be something going on with the acceptance rate calculation, which drives the epsilons to a very small value.

Tamas K. Papp · Answer 2 · Thu Aug 29 2019 16:56:50 GMT+0800 (China Standard Time)

@goedman, thank you for your patience. I investigated the issue, and the problem was that the initial optimization was going down the "funnel" common in multilevel models (zero variance, zero country-level effects). This has infinitely large density, but is hardly relevant (on a related note: I believe a reparametrization could help in general, but did not investigate this). Adapting to it gives you a tiny stepsize (for the narrow part of the funnel).

I included a heuristic to avoid it, added more examples about how to customize the initial warmup steps (look at #82), but it works without any of those now, just

results = mcmc_with_warmup(Random.GLOBAL_RNG, ∇P, 1000)

If the docstring of mcmc_with_warmup now has an example for disabling optimization.

Incidentally, I find ForwardDiff to be faster for this model (and models of a similar size) than Flux. You may want to try and benchmark both options in StatisticalRethinking for small models. Flux and reverse mode AD in general have a comparative advantage in higher dimensions.

Finally, I would appreciate if you could rerun the StatisticalRethinking models after I merge #82 to see if the heuristic is doing anything bad to any of the models (generally, it should not).

Rob J Goedman · Answer 3 · Thu Aug 29 2019 21:59:35 GMT+0800 (China Standard Time)

Thank you Tamas, all models work great with both AD options using the latest master. ForwardDiff is for these toy models indeed (quite a bit) faster, just wanted to test Flux as well. I’m studying your updates and I’m sure I will get back to you with a question! Thanks again, Rob J Goedman goedman@icloud.com

…

On Aug 29, 2019, at 04:56, Tamas K. Papp ***@***.***> wrote: @goedman <https://github.com/goedman>, thank you for your patience. I investigated the issue, and the problem was that the initial optimization was going down the "funnel" common in multilevel models (zero variance, zero country-level effects). This has infinitely large density, but is hardly relevant (on a related note: I believe a reparametrization could help in general, but did not investigate this). Adapting to it gives you a tiny stepsize (for the narrow part of the funnel). I included a heuristic to avoid it, added more examples about how to customize the initial warmup steps (look at #82 <#82>), but it works without any of those now, just results = mcmc_with_warmup(Random.GLOBAL_RNG, ∇P, 1000) If the docstring of mcmc_with_warmup now has an example for disabling optimization. Incidentally, I find ForwardDiff to be faster for this model (and models of a similar size) than Flux. You may want to try and benchmark both options in StatisticalRethinking for small models. Flux and reverse mode AD in general have a comparative advantage in higher dimensions. Finally, I would appreciate if you could rerun the StatisticalRethinking models after I merge #82 <#82> to see if the heuristic is doing anything bad to any of the models (generally, it should not). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#81?email_source=notifications&email_token=AAAMX2CAPH7OW7HHIOIP3ALQG6FNFA5CNFSM4IQHCPF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5NYYEY#issuecomment-526093331>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAMX2AXYJXCVEV2CC3JEPLQG6FNFANCNFSM4IQHCPFQ>.