danijar / dreamerv2

From my understanding, the posterior of the last timestep from a batch is used as the start state for the next batch.
Is this intended? If so, is it just to avoid always initializing the start state to zeros and have it model some random sample from the current latent distribution?

dreamerv2/dreamerv2/agent.py

Line 60 in 07d906e

state, outputs, mets = self.wm.train(data, state)

This is only used when is_first is False at the beginning of the training batch. By default, it's always True so the world model resets its hidden state (in the RSSM class). But this implementation could also support training with truncated backprop through time on longer sequences than can be fit into memory at the same time.

Why share states across random batches for training the world model?