Oscillatory behavior in the number of iterations per second

Question

Oscillatory behavior in the number of iterations per second

jpmvferreira opened this issue 3 years ago · comments

José Ferreira commented 3 years ago

General information:

emcee version: 3.0.2
platform: Linux (Manjaro)
installation method (pip/conda/source/other?): pip under a conda environment

Problem description:

Expected behavior:

Number of iterations per second remains constant throughout the program, or changes in a uniform fashion.

Actual behavior:

Number of iterations per second shows an oscillatory behavior that gets worse as the number of steps increase .

What have you tried so far?:

Nothing, unsure how to proceed.
The linear regression example available on the documentation does not show this behavior, but it's not parallelized.

Minimal example:

Haven't been able to further debug this issue, full source code is here, this is mostly based on the tutorials and uses a package that I developed.
And here's a video that I recorded of that oscillatory behavior (skip to almost to the end of the video):
https://youtu.be/tJ1ewHnh1Ns

Dan Foreman-Mackey · Answer 1 · Wed Jul 21 2021 08:30:42 GMT+0800 (China Standard Time)

Please put together a simpler example with no dependencies because I can't help debug issues with external packages.

José Ferreira · Answer 2 · Wed Jul 21 2021 19:01:45 GMT+0800 (China Standard Time)

This is not very minimal but it does show the same behavior:

# imports
from multiprocessing import Pool
from scipy.integrate import quad
from math import log, pi
import numpy as np
import emcee
import os


# avoid numpy parallelization
os.environ["OMP_NUM_THREADS"] = "1"

# data obtained for "observations" (mock catalog)
redshifts = [4.650597618519434, 1.7187615134799283, 1.96422316313268, 2.279534979583823, 2.4255221814653085, 5.352683861215292, 3.3009039750539038, 1.147489638942757, 3.095528002202573, 7.181594409133277, 5.489403487139625, 1.5667645827080101, 4.336343558668992, 4.411170021297732, 3.0302171724544715]
distances = [38.63665753463308, 13.433364627229862, 14.789874051362968, 18.35210939265398, 17.0556256529821, 46.4392305092797, 27.987621881949003, 7.7684781039518525, 25.560383436360233, 57.984090978826416, 57.96010746060587, 11.348110616357697, 40.23195030252779, 37.45595825024949, 25.975178600400568]
errors = [4.810592486986055, 0.5363050547332151, 0.7228902526237483, 1.3869590285255247, 1.5295952653405556, 6.257341530569792, 2.576815651543119, 0.21133168296086813, 2.3019708974482636, 10.918185604129675, 6.561303625821857, 0.43475055916828353, 4.225452271355636, 4.361248075989948, 2.2182722422373207]

# define the cosmological model we're going to check against observations through the light distance
def E(z, Ωm):
    return (Ωm*(1+z)**3 + (1-Ωm))**0.5

def dL(z, h, Ωm, M, E):
    # "eletromagnetic" light distance
    dLem = (1+z) * (2.9979/h) * quad(lambda Z: 1/E(Z, Ωm), 0, z)[0]  # c/H0 = 2.9979/h Gpc

    # correction to compute the "gravitacional wave" light distance
    factor = 2*6**0.5
    correction = ( (factor + M) / (factor + M/E(z, Ωm)))**0.5  # M in units of H0

    return correction * dLem

# define the likelihood
def ln_likelihood(θ, redshifts, distances, errors, dL, E):
    h, Ωm, M = θ
    N = len(redshifts)

    sum = 0
    for i in range(0, N):
        sum += -log(errors[i]) - (distances[i] - dL(redshifts[i], h, Ωm, M, E))**2 / (2*errors[i]**2)

    return -N*log(2*pi)/2 + sum

# add flat priors to both parameters
def ln_prior(θ):
    h, Ωm, M = θ
    if 0.4 < h < 1 and 0 < Ωm < 1 and -4 < M < 10:
        return 0.0
    return -np.inf

# combine the likelihood and the prior into one expression
def ln_probability(θ, redshifts, distances, errors, dL, E):
    prior = ln_prior(θ)
    if not np.isfinite(prior):
        return -np.inf
    return prior + ln_likelihood(θ, redshifts, distances, errors, dL, E)

# initial position in a gaussian ball around the true values for h and Ωm, and the expected value for M
# initialize the walkers and maximum number of steps
nwalkers = 32
ndim = 3
init = [0.7, 0.3, 2] + (1e-1, 1e-1, 2) * np.random.randn(nwalkers, ndim)
nsteps = 25000

# track how the average autocorrelation time estimate changes
index = 0
autocorr = np.empty(nsteps)

# this will be useful to testing convergence
old_tau = np.inf

# run emcee
print("Running MCMC:")
with Pool() as pool:
    sampler = emcee.EnsembleSampler(nwalkers, ndim, ln_probability, args=(redshifts, distances, errors, dL, E), pool=pool)
    # sample for up to nsteps steps
    for sample in sampler.sample(init, iterations=nsteps, progress=True):
        # check convergence every 100 steps
        if sampler.iteration % 100:
            continue

        # compute the autocorrelation time so far (tol=0 means that we'll always get an estimate even if it isn't trustworthy) (this is the average across all dimensions)
        tau = sampler.get_autocorr_time(tol=0)
        autocorr[index] = np.mean(tau)
        index += 1

        # check convergence
        converged = np.all(tau * 500 < sampler.iteration)
        converged &= np.all(np.abs(old_tau - tau) / tau < 0.005)
        if converged:
            break
        old_tau = tau

This is mostly stuff out of the tutorials.
In the beginning I define what our observations are, then the cosmological model I'm testing, likelihood, prior and add the previous two to create the probability. After that I run emcee in parallel with convergence test, effectively adding code from the section Saving & monitoring progress and Parallelization.
As I said before I didn't had this issue when running the linear regression example from the section Fitting a model to data, which doesn't have parallelization + correlation analysis.

Dan Foreman-Mackey · Answer 3 · Wed Jul 21 2021 19:52:51 GMT+0800 (China Standard Time)

Good - thanks! Now is the period of this "oscillatory behavior" exactly 100 steps? And what happens if you change the hard coded magic number 100 to something else 😀?

José Ferreira · Answer 4 · Wed Jul 21 2021 21:12:15 GMT+0800 (China Standard Time)

Oh that was rather obvious, it was factor number 2 (correlation analysis) all along.
In my mind the autocorrelation time was measured based on the last 100 steps not the entire chain, but that didn't make a lot of sense either.

This adds a new level of complexity to consider if I end up leaving the chain running for a few hours, but I suppose that there's no way around it that's the price to pay if I want to check for convergence.

Dan Foreman-Mackey · Answer 5 · Wed Jul 21 2021 22:04:01 GMT+0800 (China Standard Time)

If your runtime is dominated by the autocorr calculation, I'd recommend doing it much less frequently.

José Ferreira · Answer 6 · Wed Jul 21 2021 22:12:35 GMT+0800 (China Standard Time)

It's not dominated, but it's definitely there.
As I'm still quite new to this I end up doing a lot of runs, however, they seem to be quite fast. My professor thought my results had taken hours to get, while they actually took minutes (3 parameters with rather small priors with a ≈ 10 value window each), and the convergence criteria is met (sometimes).
Is doing it less frequently more prone to errors?

Dan Foreman-Mackey · Answer 7 · Wed Jul 21 2021 22:21:07 GMT+0800 (China Standard Time)

It won't be prone to errors. It just means that you'll overshoot your definition of "convergence" by a little more, but this is never a bad thing. It's only a trade off of computational cost so if you're seeing the effects of autocorr computation in your average runtime then it's probably worth running it less frequently because your model is fast enough.

José Ferreira · Answer 8 · Thu Jul 22 2021 01:23:18 GMT+0800 (China Standard Time)

Cool, I suppose that does it, I'll close this issue considering it's solved. Thanks.

José Ferreira · Answer 9 · Thu Jul 22 2021 18:55:16 GMT+0800 (China Standard Time)

Actually I have something to ask (hopefully I'm thinking this correctly).

After going through the source code I saw that the function get_autocorr_time simply calls the function autocorr.integrated_time, which means that what we're doing is measuring the integrated autocorrelation time, as mentioned in the documentation.
If we're doing an integral (or a sum because this is discrete) then we don't have to get the entire chain because, if we want to compute the autocorrelation time from iteration 0 to iteration N₂, and we already computed the autocorrelation time from 0 to N₁ (where N₁ < N₂), then all we have to do is compute the autocorrelation time from N₁ to N₂ and add it to the value obtained from 0 to N₁ and we would get the autocorrelation time from 0 to N₂ without getting the entire chain.

I think that my reasoning is correct because, as we saw in the section Saving & monitoring progress in the documentation, we stop our chain when it changes by less than 1% (and has done a big enough number of iterations), which means that in the mean τ plot shown in that same section, it's a straight line, meaning that the walkers have reached a point where they are walking randomly in the region of (hopefully) maximum likelihood, and as such have no correlation between them.

Is this reasoning correct?
Could something like this be implemented in emcee? For example instead of having tau = sampler.get_autocorr_time(tol=0) we could have something like tau += sampler.get_autocorr_time(tol=0, initial=N₁, end=N₂), where in the previous iteration we did tau = sampler.get_autocorr_time(tol=0, initial=0, end=N₁).

Dan Foreman-Mackey · Answer 10 · Thu Jul 22 2021 19:14:43 GMT+0800 (China Standard Time)

Take a closer look at the sum required to approximate the autocorrelation time... it touches every pair of points in the chain, hence why it's computed using an FFT. There has been some work on online estimation of acorr time (see, for example, the Goodman & Weare paper) but none of these methods are numerically stable in my experience. If your runtime is noticably hit by the cost of computing the autocorrelation time, you're doing it far to frequently, like I said above!!

José Ferreira · Answer 11 · Thu Jul 22 2021 22:02:58 GMT+0800 (China Standard Time)

Ah, I didn't quite understood the equations and as I saw the last sum it seemed to me that it was sequential.

Thanks again!