gae generalized-advantage-estimation lstm partially-observable-environment policy-gradient pomdp ppo proximal-policy-optimization reccurent-neural-network reinforcement-learning

Recurrent Policies for Handling Partially Observable Environments with ReLAx

This repository contains an implementation of PPO-GAE algorithm with lagged LSTM policy (and critic) and its comparison with 0-lag MLP PPO-GAE.

To simulate partial observability in a controlled manner a gym.Wrapper which masks observation's array elements with zeros with eps probability was created. In our experiments, the degree of partial observability was controlled through altering eps value.

Experiments results are shown below:

As we can see, for the fully observable case (eps=0) MLP and LSTM policies show roughly the same performance. For a moderate degree of partial observability (eps=0.25) LSTM policy shows slightly faster learning at the early stages. For a considerable degree of partial observability (eps=0.5) LSTM policy shows significantly better performance comparing to MLP policy. However, both actors struggled to converge to fully observable case asymptotic performance. For a staggering degree of partial observability (eps=0.75) both policies failed to learn.

About

Recurrent Policies for Handling Partially Observable Environments

gae generalized-advantage-estimation lstm partially-observable-environment policy-gradient pomdp ppo proximal-policy-optimization reccurent-neural-network reinforcement-learning

Languages

Language:Jupyter Notebook 100.0%