PolicyLearner batch_size versus episode_steps clarification

Question

PolicyLearner batch_size versus episode_steps clarification

GreatArcStudios opened this issue 6 months ago · comments

Hi all,

This library looks great, and strikes a good balance between software abstractions and the underlying math. I do however have a clarification regarding batch_size and episode_steps for the policy learner.

I decided to step through the PPO integration test, and noticed that the batch_size used in _actor_learn_batch is not actually the batch_size set in the test, i.e., 64. Instead it is 14. You can see this in the screenshot:

That said, this mismatch in batch_size can actually be traced to the run_episode function where episode_steps == 14. I'm wondering if this is intentional, and if so how should we interpret that? Here is the screenshot for run_episode:

Thanks again for the library!

Zheqing (Bill) Zhu · Answer 1 · Sun Dec 10 2023 04:58:27 GMT+0800 (China Standard Time)

Great question. We will clarify more in our next round of documentation release.

Short answer is that we still consider PPO as an on-policy algorithm and hence we current use an on-policy replay buffer that only uses data from the last episode. Hence the episode steps will override batch size.

Please let us know if you have any further questions.

Eric Zhu · Answer 2 · Sun Dec 10 2023 06:04:54 GMT+0800 (China Standard Time)

Oh I see, this does make sense then. As a follow up, there appears to also be a discrepancy between the shapes of the inputs passed into PPO networks and the DQN network. For example, we can see that the PPO is a vector, i.e., has shape (observation_dim), whereas, for DQN the input is a matrix of shape (action_dim, observation_dim + action_dim). This behaviour seems intentional as the DQN input matrix shape appears to be derived from the Q-value estimation update rules, but because the PPO network input has no batch size dimension, it is harder to build/use architectures like LSTM or MHA from libraries that require the batch size dimension, e.g., Pytorch. But it does seem like we can just unsqueeze the input as needed? Or will the library be updated to make both shapes consistent?

Zheqing (Bill) Zhu · Answer 3 · Sun Dec 10 2023 06:33:53 GMT+0800 (China Standard Time)

Ah, you are touching some core competence of our library! Our state representation does not tie to any algorithms and hence it is universal even if you want LSTM or transformers. We will share out a slide deck after our NeurIPS presentation as that will help explain a lot of these designs and the magics we put in. Stay tuned.

To your question, we don't require any of the unsqueeze operations for sequence models and it is a single history summarization module that covers all! Will come back to this issue and send you the slides once we are done with NeurIPS. Thanks!

Eric Zhu · Answer 4 · Sun Dec 10 2023 06:34:54 GMT+0800 (China Standard Time)

Gotcha, thanks! I'll stay tuned 😀