danijar / dreamerv2

Mastering Atari with Discrete World Models

Home Page:https://danijar.com/dreamerv2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Does the actor-critc train using only the stochastic state?

lewisboyd opened this issue · comments

Hi,

I'm very interested in your work but I am unclear if the actor-critic is trained only using the stochastic state as its observation or if it also uses the recurrent state? What's the reasoning behind this choice?

Thanks for all your work and for putting it on Github!

Hey, it gets both as input. In a POMDP, it definitely needs the GRU state, because that summarizes the history of observations. Empirically, it does not seem to matter much whether it also receives the stochastic sample or not.

Hi @danijar, after reading the code and the paper. I am confused. In the paper, Fig 2 tells that the learned prior $\hat{z_t}$ is used for imagination. And in Equation (3), the actor takes $\hat{z_t}$. However, in the code, I found the actor uses the posterior $z_t$ as input together with $h_t$.

It seems they are different. Could you please help me to understand it?

During imagination training, the actor takes both the GRU state a sample from the prior as input. During environment interaction, the actor takes both the GRU state and a sample from the posterior as input.

We use the prior during imagination because we don't know the corresponding observations. We use the posterior during environment interaction because we know the current observation.

The prior and posterior are trained to be close to each other using the KL loss.

During imagination training, the actor takes both the GRU state a sample from the prior as input. During environment interaction, the actor takes both the GRU state and a sample from the posterior as input.

We use the prior during imagination because we don't know the corresponding observations. We use the posterior during environment interaction because we know the current observation.

The prior and posterior are trained to be close to each other using the KL loss.

@danijar I see. Thanks for the clarification.