tensorflow / probability

Probabilistic reasoning and statistical analysis in TensorFlow

Home Page:https://www.tensorflow.org/probability/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Distributions are not independent for Empirical

sfo opened this issue · comments

While the documentation states that

The first k dimensions index into a batch of independent distributions

this is not the case for the implementation.
As can be seen from this MWE, the samples for every distribution are always the same:

import numpy as np
import tensorflow_probability as tfp

a = np.random.randint(0, 100, size=1000)
dist = tfp.distributions.Empirical([a, a])
n = 3
dist.sample(n)

# output
# <tf.Tensor: shape=(3, 2), dtype=int32, numpy=
# array([[66, 66],
#        [33, 33],
#        [98, 98]], dtype=int32)>

The root cause seems to be at this location, where indices are sampled only once but re-used for every distribution:

indices = samplers.uniform([n], maxval=self._compute_num_samples(samples),
dtype=tf.int32, seed=seed)
draws = tf.gather(samples, indices, axis=self._samples_axis)

I now fear that there might be more issues in the implementation of Empirical, rendering it unusable in contexts where I require independency between distributions.

Also the shaping of the output seems strange, which is (n, k), while I would expect it to be (k, n).

Maybe I also completely misunderstood the documentation.

Wow, yeah, that's a pretty egregious bug. Thank you for flagging. There's pretty good test coverage of this distribution, including, eg, statistical tests of mean and variance, even in the presence of batch shapes. But nothing checks independence of the samples across batches! Should be a straightforward-ish fix (will need to sample indices for each batch dimension and change the gather to a gather_nd, which is always fun :)).

Are you interested in trying to submit a patch? No pressure, just don't want to duplicate work.