facebookresearch / Pearl

A Production-ready Reinforcement Learning AI Agent Library brought by the Applied Reinforcement Learning team at Meta.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PolicyLearner batch_size versus episode_steps clarification

GreatArcStudios opened this issue · comments

Hi all,

This library looks great, and strikes a good balance between software abstractions and the underlying math. I do however have a clarification regarding batch_size and episode_steps for the policy learner.

I decided to step through the PPO integration test, and noticed that the batch_size used in _actor_learn_batch is not actually the batch_size set in the test, i.e., 64. Instead it is 14. You can see this in the screenshot:

image

That said, this mismatch in batch_size can actually be traced to the run_episode function where episode_steps == 14. I'm wondering if this is intentional, and if so how should we interpret that? Here is the screenshot for run_episode:

image

Thanks again for the library!

Great question. We will clarify more in our next round of documentation release.

Short answer is that we still consider PPO as an on-policy algorithm and hence we current use an on-policy replay buffer that only uses data from the last episode. Hence the episode steps will override batch size.

Please let us know if you have any further questions.

Oh I see, this does make sense then. As a follow up, there appears to also be a discrepancy between the shapes of the inputs passed into PPO networks and the DQN network. For example, we can see that the PPO is a vector, i.e., has shape (observation_dim), whereas, for DQN the input is a matrix of shape (action_dim, observation_dim + action_dim). This behaviour seems intentional as the DQN input matrix shape appears to be derived from the Q-value estimation update rules, but because the PPO network input has no batch size dimension, it is harder to build/use architectures like LSTM or MHA from libraries that require the batch size dimension, e.g., Pytorch. But it does seem like we can just unsqueeze the input as needed? Or will the library be updated to make both shapes consistent?

Ah, you are touching some core competence of our library! Our state representation does not tie to any algorithms and hence it is universal even if you want LSTM or transformers. We will share out a slide deck after our NeurIPS presentation as that will help explain a lot of these designs and the magics we put in. Stay tuned.

To your question, we don't require any of the unsqueeze operations for sequence models and it is a single history summarization module that covers all! Will come back to this issue and send you the slides once we are done with NeurIPS. Thanks!

Gotcha, thanks! I'll stay tuned 😀