Is it possible to define initial values for the parameter NamedTuple?
goedman opened this issue · comments
Hi Tamas,
The very last example in StatisticalRethinkingJulia/DynamicHMCModels.jl never gave problems in DynamicHMC up to v1.0.6 (for reference, the old version used with DynamicHMC v1.0.6 is in m12.6d.jl ) .
In pre-2.0 though, it seems to have difficulty in finding a reasonable stepsize (i.e. typically it sets ϵ < e-12, often way smaller, and never recovers). The updated version of m12.6d.jl is included below in the self-contained example section. Sometimes Optim
succeeds, but often it can't find an optimum (if no e.g. Random.seed!(...) is used).
I've looked at the new initialization
for q and ϵ and tried a few settings for ϵ, without luck. I tried initialization = (q=(...), )
but that is not correct. I did see a remark that both data and init values can be included in the model function but haven't figured out how to do that (I wanted to use the initial values used in DynamicHMC v1.0.6). That remark was in a comment and might be a left-over from v1.0.6 examples.
The log density function in both versions is identical.
The data for this example is in:
Please make sure you are using the latest tagged versions and the last stable release of Julia before proceeding. Support for other versions is very limited.
This is using DynamicHMC#master as of today (8/27/2019) on Julia v1.4 or v1.3-RC1.
Self-contained example that demonstrates the problem
using DynamicHMC, LogDensityProblems, TransformVariables, Flux, Random
using Distributions, Parameters, CSV, DataFrames, LinearAlgebra
Random.seed!(012451)
# Read attached Kline.txt file
df = DataFrame(CSV.read(joinpath(@__DIR__, "Kline.txt"), delim=';'));
# Add col logpop, set log() for population data
df.logpop = map((x) -> log(x), df.population);
# Add id for societies
df.society = 1:10;
Base.@kwdef mutable struct KlineModel{Ty <: AbstractVector,
Tx <: AbstractMatrix, Ts <: AbstractVector}
"Observations (total_tools)."
y::Ty
"Covariates (logpop)"
x::Tx
"Society"
s::Ts
"Number of observations (10)"
N::Int
"Number of societies (also 10)"
N_societies::Int
end
function make_transformation(model::KlineModel)
as( (β = as(Array, size(model.x, 2)), α = as(Array, model.N_societies), σ = asℝ₊) )
end
# Instantiate the model with data and inits.
N = size(df, 1)
N_societies = length(unique(df.society))
x = hcat(ones(Int64, N), df.logpop);
s = df.society
y = df.total_tools
model = KlineModel(; y=y, x=x, s=s, N=N, N_societies=N_societies)
# Make the type callable with the parameters *as a single argument*.
function (model::KlineModel)(θ)
@unpack y, x, s, N, N_societies = model # data
@unpack β, α, σ = θ # parameters
ll = 0.0
ll += logpdf(Cauchy(0, 1), σ)
ll += sum(logpdf.(Normal(0, σ), α)) # α[1:10]
ll += logpdf.(Normal(0, 10), β[1]) # a
ll += logpdf.(Normal(0, 1), β[2]) # a
ll += sum(
[loglikelihood(Poisson(exp(α[s[i]] + dot(x[i, :], β))), [y[i]]) for i in 1:N]
)
ll
end
println()
θ = (β = [1.0, 0.25], α = rand(Normal(0, 1), N_societies), σ = 0.2)
model(θ) |> display
println()
# Wrap the problem with a transformation, then use Flux for the gradient.
P = TransformedLogDensity(make_transformation(model), model)
∇P = ADgradient(:Flux, P);
# Can we initialize? E.g.:
results = mcmc_with_warmup(Random.GLOBAL_RNG, ∇P, 1000;
# initialization = (ϵ = 0.001, ),
# warmup_stages = fixed_stepsize_warmup_stages()
)
posterior = P.transformation.(results.chain)
println()
DynamicHMC.Diagnostics.EBFMI(results.tree_statistics) |> display
println()
DynamicHMC.Diagnostics.summarize_tree_statistics(results.tree_statistics) |> display
println()
posterior_β = mean(posterior[i].β for i in 1:length(posterior))
posterior_α = mean(posterior[i].α for i in 1:length(posterior))
posterior_σ = mean(posterior[i].σ for i in 1:length(posterior))
println()
posterior_β |> display
println()
posterior_α |> display
println()
posterior_σ |> display
println()
Output, expected outcome, comparison to other samplers
stan_result = "
Iterations = 1:1000
Thinning interval = 1
Chains = 1,2,3,4
Samples per chain = 1000
Empirical Posterior Estimates:
Mean SD Naive SE MCSE ESS
a 1.076167468 0.7704872560 0.01218247319 0.0210530022 1000.000000
bp 0.263056273 0.0823415805 0.00130193470 0.0022645077 1000.000000
a_society.1 -0.191723568 0.2421382537 0.00382854195 0.0060563054 1000.000000
a_society.2 0.054569029 0.2278506876 0.00360263570 0.0051693148 1000.000000
a_society.3 -0.035935050 0.1926364647 0.00304584994 0.0039948433 1000.000000
a_society.4 0.334355037 0.1929971201 0.00305155241 0.0063871707 913.029080
a_society.5 0.049747513 0.1801287716 0.00284808595 0.0043631095 1000.000000
a_society.6 -0.311903245 0.2096126337 0.00331426674 0.0053000536 1000.000000
a_society.7 0.148637507 0.1744680594 0.00275858223 0.0047660246 1000.000000
a_society.8 -0.164567976 0.1821341074 0.00287979309 0.0034297298 1000.000000
a_society.9 0.277066965 0.1758237250 0.00278001719 0.0055844175 991.286501
a_society.10 -0.094149204 0.2846206232 0.00450024719 0.0080735022 1000.000000
sigma_society 0.310352849 0.1374834682 0.00217380450 0.0057325226 575.187461
";
Contributing code for tests
Kline.txt is data from StatisticalRethinking and as such not MIT licensed.
Thanks for the report. I can replicate the problem, and will investigate it soon (today or tomorrow). There seems to be something going on with the acceptance rate calculation, which drives the epsilons to a very small value.
@goedman, thank you for your patience. I investigated the issue, and the problem was that the initial optimization was going down the "funnel" common in multilevel models (zero variance, zero country-level effects). This has infinitely large density, but is hardly relevant (on a related note: I believe a reparametrization could help in general, but did not investigate this). Adapting to it gives you a tiny stepsize (for the narrow part of the funnel).
I included a heuristic to avoid it, added more examples about how to customize the initial warmup steps (look at #82), but it works without any of those now, just
results = mcmc_with_warmup(Random.GLOBAL_RNG, ∇P, 1000)
If the docstring of mcmc_with_warmup
now has an example for disabling optimization.
Incidentally, I find ForwardDiff to be faster for this model (and models of a similar size) than Flux. You may want to try and benchmark both options in StatisticalRethinking for small models. Flux and reverse mode AD in general have a comparative advantage in higher dimensions.
Finally, I would appreciate if you could rerun the StatisticalRethinking models after I merge #82 to see if the heuristic is doing anything bad to any of the models (generally, it should not).