bsuite_tutorial problem when build PPO OpenAI baseline agent
lingjunz opened this issue · comments
There is a small problem I had when building PPO OpenAI baseline agent in the bsuite_tutorial.
- After I logged results to CSV file using the following code,
from baselines.common.vec_env import dummy_vec_env
from baselines.ppo2 import ppo2
from bsuite.utils import gym_wrapper
import tensorflow as tf
SAVE_PATH_PPO = './demo_results/bsuite/ppo'
def _load_env():
raw_env = bsuite.load_and_record(
bsuite_id='bandit_noise/0',
save_path=SAVE_PATH_PPO, logging_mode='csv', overwrite=True)
return gym_wrapper.GymFromDMEnv(raw_env)
env = dummy_vec_env.DummyVecEnv([_load_env])
- I got bsuite_id_-_bandit_noise-0.csv file like this:
steps,episode,total_return,episode_len,episode_return,total_regret
1,1,[49.09808016],1,[0.67640523],[51.5]
2,2,[49.09808016],1,[0.74001572],[51.5]
3,3,[49.09808016],1,[0.7978738],[51.5]
4,4,[49.09808016],1,[0.62408932],[51.5]
- When I ran the next cell, there is an assertion error.
ppo2.learn(
env=env, network='mlp', lr=1e-3, gamma=.99,
total_timesteps=10000, nsteps=100)
**output**
input shape is (1, 1)
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-2-d47907e196cf> in <module>
1 ppo2.learn(
2 env=env, network='mlp', lr=1e-3, gamma=.99,
----> 3 total_timesteps=10000, nsteps=100)
~/anaconda3/envs/drl/lib/python3.6/site-packages/baselines/ppo2/ppo2.py in learn(network, env, total_timesteps, eval_env, seed, nsteps, ent_coef, lr, vf_coef, max_grad_norm, gamma, lam, log_interval, nminibatches, noptepochs, cliprange, save_interval, load_path, model_fn, **network_kwargs)
177 # or if it's just worse than predicting nothing (ev =< 0)
178 # print( returns.shape,values.shape)
--> 179 ev = explained_variance(values, returns)
180 logger.logkv("misc/serial_timesteps", update*nsteps)
181 logger.logkv("misc/nupdates", update)
~/anaconda3/envs/drl/lib/python3.6/site-packages/baselines/common/math_util.py in explained_variance(ypred, y)
34
35 """
---> 36 assert y.ndim == 1 and ypred.ndim == 1
37 vary = np.var(y)
38 return np.nan if vary==0 else 1 - np.var(y-ypred)/vary
AssertionError:
-
I found this due to mismatched shape of values(100, 1) and returns(10000, 1) before
explained_variance(values, returns)
. -
When I add one line in 'baselines/ppo2/runner.py', it seems to run correctly.
...
#batch of steps to batch of rollouts
mb_obs = np.asarray(mb_obs, dtype=self.obs.dtype)
mb_rewards = np.asarray(mb_rewards, dtype=np.float32)
mb_actions = np.asarray(mb_actions)
mb_values = np.asarray(mb_values, dtype=np.float32)
mb_values = mb_values.reshape(mb_rewards.shape) <<< add this line
mb_neglogpacs = np.asarray(mb_neglogpacs, dtype=np.float32)
mb_dones = np.asarray(mb_dones, dtype=np.bool)
last_values = self.model.value(tf.constant(self.obs))._numpy()
...
- final result
Stepping environment...
--------------------------------------------
| eplenmean | nan |
| eprewmean | nan |
| fps | 271 |
| loss/approxkl | 2.5486004e-08 |
| loss/clipfrac | 0.0 |
| loss/policy_entropy | 2.3978922 |
| loss/policy_loss | -2.7894964e-09 |
| loss/value_loss | 0.061606925 |
| misc/explained_variance | 0 |
| misc/nupdates | 100 |
| misc/serial_timesteps | 10000 |
| misc/time_elapsed | 37.5 |
| misc/total_timesteps | 10000 |
--------------------------------------------
- p.s. I use tf2.1.0 and checkout to tf2 branch after git clone baselines.
Hi there! Thanks for the detailed bug report. It seems like this is potentially an issue with the ppo baseline, which is outside the scope of bsuite.
I do notice you mention that you're using TF2, but as far as I can tell, the OpenAI baselines require TF 1.x to run -- could this be part of the issue?