Difference in the KL loss terms in the paper and the code

Question

Difference in the KL loss terms in the paper and the code

shivakanthsujit opened this issue 3 years ago · comments

The algorithm for the KL balancing in the paper has the posterior and prior terms given as kl_loss = compute_kl(stop_grad(posterior), prior). So I had assumed that the code would have computed the loss as value = kld(dist(sg(post)), dist(prior)).

But instead the code has the terms reversed, with the KL loss formulated as (in networks.py, line 168) value = kld(dist(prior), dist(sg(post))).

Does that have something to do with the implementation of the kl divergence function in tensorflow_probability?

Danijar Hafner · Answer 1 · Sun Mar 14 2021 07:06:06 GMT+0800 (China Standard Time)

KL balancing is implemented as weighted average of two terms, the KL with stop-grad prior and the KL with stop-grad posterior.

The value you found in the code is only used for logging. It is not what the gradient is computed of.