Does the actor-critc train using only the stochastic state?

Question

Does the actor-critc train using only the stochastic state?

lewisboyd opened this issue 3 years ago · comments

Hi,

I'm very interested in your work but I am unclear if the actor-critic is trained only using the stochastic state as its observation or if it also uses the recurrent state? What's the reasoning behind this choice?

Thanks for all your work and for putting it on Github!

Danijar Hafner · Answer 1 · Tue Sep 28 2021 02:48:39 GMT+0800 (China Standard Time)

Hey, it gets both as input. In a POMDP, it definitely needs the GRU state, because that summarizes the history of observations. Empirically, it does not seem to matter much whether it also receives the stochastic sample or not.

GoingMyWay · Answer 2 · Sun Aug 07 2022 11:07:05 GMT+0800 (China Standard Time)

Hi @danijar, after reading the code and the paper. I am confused. In the paper, Fig 2 tells that the learned prior $\hat{z_t}$ is used for imagination. And in Equation (3), the actor takes $\hat{z_t}$. However, in the code, I found the actor uses the posterior $z_t$ as input together with $h_t$.

It seems they are different. Could you please help me to understand it?

Danijar Hafner · Answer 3 · Mon Aug 08 2022 03:37:23 GMT+0800 (China Standard Time)

During imagination training, the actor takes both the GRU state a sample from the prior as input. During environment interaction, the actor takes both the GRU state and a sample from the posterior as input.

We use the prior during imagination because we don't know the corresponding observations. We use the posterior during environment interaction because we know the current observation.

The prior and posterior are trained to be close to each other using the KL loss.

GoingMyWay · Answer 4 · Mon Aug 08 2022 10:34:15 GMT+0800 (China Standard Time)

During imagination training, the actor takes both the GRU state a sample from the prior as input. During environment interaction, the actor takes both the GRU state and a sample from the posterior as input.

We use the prior during imagination because we don't know the corresponding observations. We use the posterior during environment interaction because we know the current observation.

The prior and posterior are trained to be close to each other using the KL loss.

@danijar I see. Thanks for the clarification.