bsuite_tutorial problem when build PPO OpenAI baseline agent

Question

bsuite_tutorial problem when build PPO OpenAI baseline agent

lingjunz opened this issue 5 years ago · comments

There is a small problem I had when building PPO OpenAI baseline agent in the bsuite_tutorial.

After I logged results to CSV file using the following code,

from baselines.common.vec_env import dummy_vec_env
from baselines.ppo2 import ppo2
from bsuite.utils import gym_wrapper
import tensorflow as tf

SAVE_PATH_PPO = './demo_results/bsuite/ppo'
def _load_env():
raw_env = bsuite.load_and_record(
bsuite_id='bandit_noise/0', 
save_path=SAVE_PATH_PPO, logging_mode='csv', overwrite=True)
return gym_wrapper.GymFromDMEnv(raw_env)
env = dummy_vec_env.DummyVecEnv([_load_env])

I got bsuite_id_-_bandit_noise-0.csv file like this:

steps,episode,total_return,episode_len,episode_return,total_regret
1,1,[49.09808016],1,[0.67640523],[51.5]
2,2,[49.09808016],1,[0.74001572],[51.5]
3,3,[49.09808016],1,[0.7978738],[51.5]
4,4,[49.09808016],1,[0.62408932],[51.5]

When I ran the next cell, there is an assertion error.

ppo2.learn(
    env=env, network='mlp', lr=1e-3, gamma=.99,
    total_timesteps=10000, nsteps=100)

**output**
input shape is (1, 1)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-2-d47907e196cf> in <module>
      1 ppo2.learn(
      2     env=env, network='mlp', lr=1e-3, gamma=.99,
----> 3     total_timesteps=10000, nsteps=100)

~/anaconda3/envs/drl/lib/python3.6/site-packages/baselines/ppo2/ppo2.py in learn(network, env, total_timesteps, eval_env, seed, nsteps, ent_coef, lr, vf_coef, max_grad_norm, gamma, lam, log_interval, nminibatches, noptepochs, cliprange, save_interval, load_path, model_fn, **network_kwargs)
    177             # or if it's just worse than predicting nothing (ev =< 0)
    178 #             print( returns.shape,values.shape)
--> 179             ev = explained_variance(values, returns)
    180             logger.logkv("misc/serial_timesteps", update*nsteps)
    181             logger.logkv("misc/nupdates", update)

~/anaconda3/envs/drl/lib/python3.6/site-packages/baselines/common/math_util.py in explained_variance(ypred, y)
     34 
     35     """
---> 36     assert y.ndim == 1 and ypred.ndim == 1
     37     vary = np.var(y)
     38     return np.nan if vary==0 else 1 - np.var(y-ypred)/vary

AssertionError:

I found this due to mismatched shape of values(100, 1) and returns(10000, 1) before explained_variance(values, returns).
When I add one line in 'baselines/ppo2/runner.py', it seems to run correctly.

...
       #batch of steps to batch of rollouts
        mb_obs = np.asarray(mb_obs, dtype=self.obs.dtype)
        mb_rewards = np.asarray(mb_rewards, dtype=np.float32)
        mb_actions = np.asarray(mb_actions)
        mb_values = np.asarray(mb_values, dtype=np.float32)
        mb_values = mb_values.reshape(mb_rewards.shape)  <<<  add this line
        
        mb_neglogpacs = np.asarray(mb_neglogpacs, dtype=np.float32)
        mb_dones = np.asarray(mb_dones, dtype=np.bool)
        last_values = self.model.value(tf.constant(self.obs))._numpy()
...

final result

Stepping environment...
--------------------------------------------
| eplenmean               | nan            |
| eprewmean               | nan            |
| fps                                 | 271            |
| loss/approxkl           | 2.5486004e-08  |
| loss/clipfrac              | 0.0            |
| loss/policy_entropy     | 2.3978922      |
| loss/policy_loss        | -2.7894964e-09 |
| loss/value_loss         | 0.061606925    |
| misc/explained_variance | 0              |
| misc/nupdates                  | 100            |
| misc/serial_timesteps   | 10000          |
| misc/time_elapsed        | 37.5           |
| misc/total_timesteps    | 10000          |
--------------------------------------------

p.s. I use tf2.1.0 and checkout to tf2 branch after git clone baselines.

John Aslanides · Answer 1 · Fri Mar 13 2020 07:36:49 GMT+0800 (China Standard Time)

Hi there! Thanks for the detailed bug report. It seems like this is potentially an issue with the ppo baseline, which is outside the scope of bsuite.

I do notice you mention that you're using TF2, but as far as I can tell, the OpenAI baselines require TF 1.x to run -- could this be part of the issue?