pymc-devs / mcbackend

A backend for storing MCMC draws.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Storing and reading inference data with arviz not working

shakasaki opened this issue · comments

Hi there, I've been running the mcbackend with clickhouse and noticed that it is only possible to store the inference data with cloudpickle but not directly with arviz (as netcdf, for example). Here is a minimal reproducible example that uses pymc. I load the trace and convert to inference data, but when I try to save directly (either with .to_netcdf() call or with arviz) I get the following error:
ValueError: unsupported dtype for netCDF4 variable: bool`

It is true that one of the variables in the inference xarray is boolean, see info below the code

import clickhouse_driver
import mcbackend
import pymc as pm
import arviz as az
import numpy as np
import cloudpickle

# Initialize random number generator
RANDOM_SEED = 8927
np.random.seed(RANDOM_SEED)
az.style.use("arviz-darkgrid")

# True parameter values
alpha, sigma = 1, 1
beta = [1, 2.5]

# Size of dataset
size = 100

# Predictor variable
X1 = np.random.randn(size)
X2 = np.random.randn(size) * 0.2

# Simulate outcome variable
Y = alpha + beta[0] * X1 + beta[1] * X2 + np.random.randn(size) * sigma

ch_client = clickhouse_driver.Client("localhost")
backend = mcbackend.ClickHouseBackend(ch_client)

with pm.Model():
    alpha = pm.Normal("alpha", mu=0, sigma=10)
    beta = pm.Normal("beta", mu=0, sigma=10, shape=2)
    sigma = pm.HalfNormal("sigma", sigma=1)
    mu = alpha + beta[0] * X1 + beta[1] * X2
    Y_obs = pm.Normal("Y_obs", mu=mu, sigma=sigma, observed=Y)
    trace = mcbackend.pymc.TraceBackend(backend)
    pm.sample(trace=trace)

ch_client = clickhouse_driver.Client("localhost")
backend = mcbackend.ClickHouseBackend(ch_client)
run = backend.get_run(trace.run_id)
idata = run.to_inferencedata()

# save with cloudpickle
with open('clickhouse_backend_idata_as_pkl.pkl', mode='wb') as file:
    cloudpickle.dump(idata, file)

with open('clickhouse_backend_idata_as_pkl.pkl', mode='rb') as file:
    instance = cloudpickle.load(file)

print(instance)

# test saving directly
idata.to_netcdf('clickhouse_backend_idata_as_netcdf')
# test saving with arviz
az.to_netcdf(idata, 'clickhouse_backend_idata_as_netcdf_w_az')

# Last two approaches give: ValueError: unsupported dtype for netCDF4 variable: bool`

Output of idata.sample_stats

<xarray.Dataset>
Dimensions:                         (chain: 4, draw: 1000)
Coordinates:
  * chain                           (chain) int64 0 1 2 3
  * draw                            (draw) int64 0 1 2 3 4 ... 996 997 998 999
Data variables: (12/18)
    tune                            (chain, draw) bool False False ... False
    sampler_0__depth                (chain, draw) object 2 2 2 2 2 ... 2 2 2 2 2
    sampler_0__step_size            (chain, draw) object 0.9996654167024928 ....
    sampler_0__tune                 (chain, draw) object False False ... False
    sampler_0__mean_tree_accept     (chain, draw) object 0.8445087419794938 ....
    sampler_0__step_size_bar        (chain, draw) object 0.9975006666986591 ....
    ...                              ...
    sampler_0__process_time_diff    (chain, draw) object 0.000682679999999935...
    sampler_0__perf_counter_diff    (chain, draw) object 0.000682451999978184...
    sampler_0__perf_counter_start   (chain, draw) object 888.214387209 ... 88...
    sampler_0__largest_eigval       (chain, draw) object nan nan nan ... nan nan
    sampler_0__smallest_eigval      (chain, draw) object nan nan nan ... nan nan
    sampler_0__index_in_trajectory  (chain, draw) object 2 -1 2 3 ... -1 1 -2 2
Attributes:
    created_at:     2022-08-09T16:06:07.465421
    arviz_version:  0.12.1

Hi @shakasaki, I had the same problem a few weeks ago and (thought that I had) fixed it in ec167a7.
This was released with version 0.1.2.

When I run your example above, I get a different output for idata.sample_stats. Note that in my case the stats are properly dtyped, while your print indicates dtype=object arrays.
Here I can also do idata.to_netcdf("trace.nc") without problems.

<xarray.Dataset>
Dimensions:                         (chain: 4, draw: 1000)
Coordinates:
  * chain                           (chain) int32 0 1 2 3
  * draw                            (draw) int32 0 1 2 3 4 ... 996 997 998 999
Data variables: (12/18)
    tune                            (chain, draw) bool False False ... False
    sampler_0__depth                (chain, draw) int64 2 2 2 2 2 ... 1 2 2 2 2
    sampler_0__step_size            (chain, draw) float64 0.8567 ... 1.111
    sampler_0__tune                 (chain, draw) bool False False ... False
    sampler_0__mean_tree_accept     (chain, draw) float64 0.7567 ... 0.6179
    sampler_0__step_size_bar        (chain, draw) float64 0.9874 ... 1.051
    ...                              ...
    sampler_0__process_time_diff    (chain, draw) float64 0.0 0.0 ... 0.0
    sampler_0__perf_counter_diff    (chain, draw) float64 0.0008707 ... 0.000...
    sampler_0__perf_counter_start   (chain, draw) float64 20.52 20.52 ... 10.75
    sampler_0__largest_eigval       (chain, draw) float64 nan nan ... nan nan
    sampler_0__smallest_eigval      (chain, draw) float64 nan nan ... nan nan
    sampler_0__index_in_trajectory  (chain, draw) int64 -1 -2 0 3 ... 2 -1 1 -2
Attributes:
    created_at:     2022-08-09T21:41:38.654444
    arviz_version:  0.12.1

This is only a matter of the client-side so you can keep your model running.
Assuming that you're running something <0.1.2 an update should fix it :)

Hi @michaelosthege and thanks for the feedback. Indeed, I had v 0.1.1 and I now pulled the latest version and installed. But now with the new version I am getting an error that I did not have before. Namely, before I could convert the data from the mcbackend trace to inference data without complaints ( and I was actually saving it succesfully as a dataframe) but now I get the following:

import clickhouse_driver
import mcbackend

ch_client = clickhouse_driver.Client("localhost")
backend = mcbackend.ClickHouseBackend(ch_client)
run = backend.get_run('N747HN')
run.to_inferencedata()



Chains vary in length. Lenghts are: {'N747HN_chain_0': 89, 'N747HN_chain_1': 89, 'N747HN_chain_2': 90, 'N747HN_chain_3': 89}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/alexisshakas/git/mcbackend/mcbackend/core.py", line 217, in to_inferencedata
    idata = from_dict(
  File "/home/alexisshakas/.conda/envs/bedretto/lib/python3.8/site-packages/arviz/data/io_dict.py", line 435, in from_dict
    return DictConverter(
  File "/home/alexisshakas/.conda/envs/bedretto/lib/python3.8/site-packages/arviz/data/io_dict.py", line 335, in to_inference_data
    "posterior": self.posterior_to_xarray(),
  File "/home/alexisshakas/.conda/envs/bedretto/lib/python3.8/site-packages/arviz/data/base.py", line 65, in wrapped
    return func(cls)
  File "/home/alexisshakas/.conda/envs/bedretto/lib/python3.8/site-packages/arviz/data/io_dict.py", line 106, in posterior_to_xarray
    dict_to_dataset(
  File "/home/alexisshakas/.conda/envs/bedretto/lib/python3.8/site-packages/arviz/data/base.py", line 307, in dict_to_dataset
    data_vars[key] = numpy_to_data_array(
  File "/home/alexisshakas/.conda/envs/bedretto/lib/python3.8/site-packages/arviz/data/base.py", line 254, in numpy_to_data_array
    return xr.DataArray(ary, coords=coords, dims=dims)
  File "/home/alexisshakas/.conda/envs/bedretto/lib/python3.8/site-packages/xarray/core/dataarray.py", line 402, in __init__
    coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
  File "/home/alexisshakas/.conda/envs/bedretto/lib/python3.8/site-packages/xarray/core/dataarray.py", line 121, in _infer_coords_and_dims
    raise ValueError(
ValueError: different number of dimensions on data and dims: 2 vs 3

** I should mention here that I use a custom likelihood function so I am actually not saving any data arrays in pymc in the "default" way, but computing the likelihood using the data stored in a dataframe and calling an external function.

ValueError: different number of dimensions on data and dims: 2 vs 3

This actually sounds like a problem with your model! There seems to be a model variable with named dims that has a different shape than the dim lengths.
Unfortunately xarray doesn't tell us which variable.

I would recommend to do the following:

with your_model:
    idata = pm.sample(
        tune=2, draws=3,
        step=pm.Metropolis(),
        compute_convergence_checks=False,
    )

This should take just a moment and then run into the same error without even using McBackend.

If you can confirm this, start looking at the variable shapes:

  • your_model.RV_dims are the dims tuples for each variable
  • your_model.dim_lengths are symbolic dimension lengths. You can .eval() them and compare with the variable shapes
  • Also adding assert some_var.eval().shape == (1, 2, 3) after each newly created variable can help to identify the one that's incorrect.
  • Similarly: assert your_model.dim_lengths["some_dim"].eval() == 123 to confirm the dim lengths

A long-term solution would be to add shape checks to the .to_inferencedata() code in McBackend.
Here we could raise more informative errors that name the variable, dim and shapes that don't match.

Thanks for the quick feedback! I tried your suggestion and actually I do not run into an error. I checked the dimensions and it all seems to make sense. I am suprised because this was not happening before I upgrade to v.0.1.2. I could convert the trace to inference data without problem.

The other thing I notice is that the chains have different lengths at the time when I fetch the data. Could this be linked to my problem?

>>> run.to_inferencedata( ... ) Chains vary in length. Lenghts are: {'N747HN_chain_0': 110, 'N747HN_chain_1': 111, 'N747HN_chain_2': 111, 'N747HN_chain_3': 111}

I don't see how the different chain lengths would result in a different number of dimensions on data and dims: 2 vs 3 as the error message claims, but on the other hand there was a change in this part between versions 0.1.1 and 0.1.2:

In c26a55e I started truncating retrieved variables/stats to the length of the respective chain, because I was experiencing database inserts while downloading.
But this just truncates within the chains.

Could you locally test if truncating to the shortest chain makes a difference?
That would mean moving this line out of the for iteration. You could just copy the .to_inferencedata method into a Jupyter notebook for testing..

If this indeed fixes the problem, we should truncate to the shortest chain, but my prior here is that ArviZ tolerates uneven chain lengths.

Also just as a positive side note: Things are about to improve with McBackend, because we're just beginning to ramp-up using it internally :)

So even if things are not perfectly smooth yet, you can exect this to get better over the coming days.

you mean to move
clen = chain_lengths[chain.cid]

outside the loop? That wont work because chain.cid is defined inside the loop (I cant get it to work).

Instead, I tried forcing the chain length to a value that is smaller than the minimum for all chains and I can get run.to_inferencedata() to work! In this case, I set
clen = 100 (my smallest chain is 111 and the others 112).

I guess it happens because my forward runs take too long to evaluate and it happens that the chains are not of equal length..

you mean to move clen = chain_lengths[chain.cid] outside the loop? That wont work because chain.cid is defined inside the loop (I cant get it to work).

We'll just need a clen = min(chain_lengths.values()) outside the loop.
Do you want to make a PR ?

In the meantime I'll ask some ArviZ people to confirm that ArviZ doesn't work with uneven chain lengths.

Yes, this works for me, even though is still get the warning

WARNING (aesara.tensor.blas): Using NumPy C-API based implementation for BLAS functions. Chains vary in length. Lenghts are: {'N747HN_chain_0': 114, 'N747HN_chain_1': 115, 'N747HN_chain_2': 115, 'N747HN_chain_3': 115} /home/alexisshakas/.conda/envs/bedretto/lib/python3.8/site-packages/arviz/data/base.py:220: UserWarning: More chains (4) than draws (0). Passed array should have shape (chains, draws, *shape) warnings.warn(

I'll create the pull request, thank you for the help!

UserWarning: More chains (4) than draws (0).

0 draws? That's strange.

Allegedly InferenceData supports uneven chain lengths by filling up with nan, but it might still be a good idea to truncate, because the algorithms & visualizations probably aren't very robust in this regard.

Yes, that's strange indeed. I think it's because my simulation is still in the warmup period and everything is stored in idata.warmup_**. I can now create the inferencedata object with the added truncation of the chains and save it from the clickhouse backend to memory :) You can see in the warmup stats and posterior below that there are draws, but the idata.posterior and idata.sample_stats has no draws yet...as I said, my forward solver is slow and the inference takes very long to run (I started a run a couple of days ago and I am at 6.41% [615/9600 39:10:05<572:14:14 Sampling 4 chains, 0 divergences] )


>>> idata.warmup_sample_stats
<xarray.Dataset>
Dimensions:              (chain: 4, draw: 152)
Coordinates:
  * chain                (chain) int64 0 1 2 3
  * draw                 (draw) int64 0 1 2 3 4 5 6 ... 146 147 148 149 150 151
Data variables:
    tune                 (chain, draw) bool True True True ... True True True
    sampler_0__accept    (chain, draw) float64 inf 5.396e+161 ... inf 5.515e+22
    sampler_0__accepted  (chain, draw) float64 0.9683 0.9206 ... 0.7143 0.5079
    sampler_0__tune      (chain, draw) bool True True True ... True True True
    sampler_0__scaling   (chain, draw) float64 1.0 1.0 1.0 ... 2.594 2.594 2.594
    sampler_1__accept    (chain, draw) float64 inf 2.691e+167 ... 1.273e+141
    sampler_1__accepted  (chain, draw) float64 0.9683 0.9048 ... 0.5397 0.6825
    sampler_1__tune      (chain, draw) bool True True True ... True True True
    sampler_1__scaling   (chain, draw) float64 1.0 1.0 1.0 ... 2.487 2.487 2.487
Attributes:
    created_at:     2022-08-10T23:27:33.399799
    arviz_version:  0.12.1

>>> idata.warmup_posterior
<xarray.Dataset>
Dimensions:   (chain: 4, draw: 152, dip: 63, azimuth: 63)
Coordinates:
  * chain     (chain) int64 0 1 2 3
  * draw      (draw) int64 0 1 2 3 4 5 6 7 8 ... 144 145 146 147 148 149 150 151
  * dip       (dip) <U13 'dipMB1u144p43' 'dipMB1u160p18' ... 'dipST2u194p69'
  * azimuth   (azimuth) <U13 'aziMB1u144p43' 'aziMB1u160p18' ... 'aziST2u194p69'
Data variables:
    dips      (chain, draw, dip) float64 0.0 -1.612 0.5782 ... -7.069 17.83
    azimuths  (chain, draw, azimuth) float64 0.4381 0.6715 ... -4.774 12.31
Attributes:
    created_at:     2022-08-10T23:27:33.392228
    arviz_version:  0.12.1 

here's what I get for idata.posterior and idata.sample_stats

>>> idata.posterior
<xarray.Dataset>
Dimensions:   (chain: 4, draw: 0, dip: 63, azimuth: 63)
Coordinates:
  * chain     (chain) int64 0 1 2 3
  * draw      (draw) int64 
  * dip       (dip) <U13 'dipMB1u144p43' 'dipMB1u160p18' ... 'dipST2u194p69'
  * azimuth   (azimuth) <U13 'aziMB1u144p43' 'aziMB1u160p18' ... 'aziST2u194p69'
Data variables:
    dips      (chain, draw, dip) float64 
    azimuths  (chain, draw, azimuth) float64 
Attributes:
    created_at:     2022-08-10T23:36:16.087904
    arviz_version:  0.12.1
>>> idata.sample_stats
<xarray.Dataset>
Dimensions:              (chain: 4, draw: 0)
Coordinates:
  * chain                (chain) int64 0 1 2 3
  * draw                 (draw) int64 
Data variables:
    tune                 (chain, draw) bool 
    sampler_0__accept    (chain, draw) float64 
    sampler_0__accepted  (chain, draw) float64 
    sampler_0__tune      (chain, draw) bool 
    sampler_0__scaling   (chain, draw) float64 
    sampler_1__accept    (chain, draw) float64 
    sampler_1__accepted  (chain, draw) float64 
    sampler_1__tune      (chain, draw) bool 
    sampler_1__scaling   (chain, draw) float64 
Attributes:
    created_at:     2022-08-10T23:36:16.092556
    arviz_version:  0.12.1