Goal: Implement PPO for Car Racing environment
- Policy Gradient
- Natural Policy Gradient
- Trust Region Policy Optimization [TRPO]
- Proximal Policy Optimization [PPO]
-
Environments are basically categorized for 2 parts
- Episodic
- List of states
s
, actionsu
, rewardsr
, and of course new statess'
- List of states
- Continuous
- No terminal State
- Episodic
-
Two ways of learning
-
Monte Carlo Approach
- Collecting rewards after the end of episode
- Calculate max expected future reward
- gets better by iteration
V(s
t
) <- V(s
t
) + α(R(t) - V(s
t
))
- max expected future reward starting from this state <-- former estimation + learning rate * (return - estimation of reward)
- Problem with this approach: we calculate rewards at the end of every episode, we average all actions, even if some bad actions took place this will result in averaging them as good actions if the end result (as per episode) was good.
- Every problem has a solution:
- Actor Critic: hybrid between policy based and value based methods
- Proximal Policy Gradient: Ensures deviation from previous policy is relatively small
-
Temporal Difference
- Estimate the reward at each step, gets better each step
V(s
t
) <- V(s
t
) + α(R(t+1) + γV(S[t+1]) - V(s
t
))
-
-
However, we always need to balance the tradeoffs between exploration & exploitation.
-
Basically we have 3 approaches to RL
- Value based (e.g. Q-Learning & DQN, Value Iteration)
- Goal to optimize the value function
V(s)
- V(s) tells us the maximum expected future reward agent will get at each state
- Goal to optimize the value function
- Policy Based
- Goal directly optimize policy function
- action = π(state)
- Policy
π
might be deterministic or stochastic - stochastic is better as it smooths the distribution of actions probability
- Goal directly optimize policy function
- Model Based
- model the environment and model its behavior
- Value based (e.g. Q-Learning & DQN, Value Iteration)
-
In this project, I am implementing policy based approach
-
my motive
- I already explored Value Based approaches (Q-Learning)
- I know nothing about the model based approach
- Policy based approach is very good for continuous action space and more effective in high dimensional space of observations
- Convergence, value based has oscillations while training
- Policy based follows policy gradient trying to find the best parameters and smooth the update at each step
- Finally, it learns stochastic policies, so no need to implement explore exploit strategy.
-
There is no free lunch
- It might converge to a local maximum
- It takes more time to train compared to value based functions
-
-
Policy based RL approach:
- Instead of learning value function that tell us the expected sum of rewards given a state and action, we learn directly the policy function that maps the state to action thus selecting action without using the value function.
- We have the value function V(s) helps us optimize the policy but it does not select an action for us.
- Policy
pi
has parameterstheta
and basically policy based methods are viewed as optimization problems, and we are searching for best params to maximize the score function.- How?
- Measure quality of policy with a policy score function (objective function, calculating expected rewards of policy)
- Use policy gradient descent or ascent to find the best policy parameters improving our policy
- How?
-
For random agent run in command prompt
python -u CarRacing-run_random_agent.py
or check it here -
Action Space is
Box(3, )
, which is[Steering, gas, breaks]
-
For implementation of PPO [Based on A3C]
-
Create command line interface
import argparse def parse_arg(): parser = argparse.ArgumentParser( description="Trains an agent in a the CarRacing-v0 environment with proximal policy optimization") # Hyper-parameters parser.add_argument("--initial_lr", type=float, default=3e-4) parser.add_argument("--discount_factor", type=float, default=0.99) parser.add_argument("--gae_lambda", type=float, default=0.95) parser.add_argument("--ppo_epsilon", type=float, default=0.2) parser.add_argument("--value_scale", type=float, default=0.5) parser.add_argument("--entropy_scale", type=float, default=0.01) parser.add_argument("--horizon", type=int, default=128) parser.add_argument("--num_epochs", type=int, default=10) parser.add_argument("--batch_size", type=int, default=128) parser.add_argument("--num_envs", type=int, default=16) # Training vars parser.add_argument("--model_name", type=str, default='CarRacing-v0') parser.add_argument("--save_interval", type=int, default=1000) parser.add_argument("--eval_interval", type=int, default=200) parser.add_argument("--record_episodes", type=bool, default=True) parser.add_argument("-restart", action="store_true") params = vars(parser.parse_args()) return params
-
Input states here as mentioned before are of
Box(96, 96, 3)
, which we will need to remove color frames, as they add extra params for the computations, and then we will crop unnecessary pieces of information by cropping the frame, followed by down-samplingdef crop(frame): # Crop to 84x84 return frame[:-12, 6:-6] def rgb2grayscale(frame): # change to grayscale return np.dot(frame[..., 0:3], [0.299, 0.587, 0.114]) def normalize(frame): return frame / 255.0 def preprocess_frame(frame): frame = crop(frame) frame = rgb2grayscale(frame) frame = normalize(frame) frame = frame * 2 - 1 return frame
-
Start the training part
-
Create env
env = gym.make('carracing-v0')
-
Set the training parameters (initial_lr, discount_factor, gae_lambda, ppo_epsilon, value_scale, entropy_scale, horizon, num_epochs, batch_size, num_envs)
# Traning parameters initial_lr = params["initial_lr"] discount_factor = params["discount_factor"] gae_lambda = params["gae_lambda"] ppo_epsilon = params["ppo_epsilon"] value_scale = params["value_scale"] entropy_scale = params["entropy_scale"] horizon = params["horizon"] num_epochs = params["num_epochs"] batch_size = params["batch_size"] num_envs = params["num_envs"]
-
Set env constants (frame_stack_size, input_shape, num_actions, action_min, action_max)
# Environment constants frame_stack_size = 4 input_shape = (84, 84, frame_stack_size) num_actions = test_env.action_space.shape[0] action_min = test_env.action_space.low action_max = test_env.action_space.high
-
Create the model
model = PPO(input_shape, num_actions, action_min, action_max, epsilon=ppo_epsilon, value_scale=value_scale, entropy_scale=entropy_scale, model_name=model_name)
-
Create policy gradient train function
- Create placeholders for returns and advantage.
-
Calculate ratio
- πθ(a|s) / πθold(a|s)
- r_t(θ) = π(a_t | s_t; θ) / π(a_t | s_t; θ_old)
- r_t(θ) = exp( log ( π(a_t | s_t; θ) / π(a_t | s_t; θ_old) ) )
- r_t(θ) = exp( log π(a_t | s_t; θ) - log π(a_t | s_t; θ_old) )
self.prob_ratio = tf.exp( self.policy.action_log_prob - self.policy_old.action_log_prob)
- πθ(a|s) / πθold(a|s)
-
Validate values
-
Policy loss
adv = tf.expand_dims(self.advantage, axis=-1) self.policy_loss = tf.reduce_mean(tf.minimum(self.prob_ratio * adv, tf.clip_by_value(self.prob_ratio, 1.0 - epsilon, 1.0 + epsilon) * adv))
-
value loss
self.value_loss = tf.reduce_mean(tf.squared_difference( tf.squeeze(self.policy.value), self.returns)) * value_scale
-
entropy loss
self.entropy_loss = tf.reduce_mean(tf.reduce_sum( self.policy.action_normal.entropy(), axis=-1)) * entropy_scale
-
total loss
self.loss = -self.policy_loss + self.value_loss - self.entropy_loss
-
policy parameters
policy_params = tf.get_collection( tf.GraphKeys.TRAINABLE_VARIABLES, scope="policy/") policy_old_params = tf.get_collection( tf.GraphKeys.TRAINABLE_VARIABLES, scope="policy_old/")
-
Minimize the loss
self.learning_rate = tf.placeholder( shape=(), dtype=tf.float32, name="lr_placeholder") self.optimizer = tf.train.AdamOptimizer( learning_rate=self.learning_rate) self.train_step = self.optimizer.minimize( self.loss, var_list=policy_params)
-
update network parameters
self.update_op = tf.group( [dst.assign(src) for src, dst in zip(policy_params, policy_old_params)])
-
Log
- Check if there was earlier training data
-
-
Create the agents
- Let every agent play the game for a number of steps (horizon)
- Predict and value action given state
- Get state of agents
- Predict each action, perform the action
- Sample action from a Gaussian distribution
- Store state, action, reward
- Get new state
- Predict and value action given state
- Calculate last values (bootstrap values)
- Flatten arrays
- Train for some number of epochs
- Update old policy
- Evaluate model
- make_video, based on flag
- Fetch the current state
- predict the action
- Compute returns
- value error
- log values
- Save model
- Sample mini-batch randomly
- Optimize network
- Optimize the learning rate
- Let every agent play the game for a number of steps (horizon)
-
-
- After training for 280,000 episodes (equivalent to +36 hours) on GPU - NIVIDIA GeForce GTX 950M -