seungeunrho / minimalRL

Implementations of basic RL algorithms with minimal lines of codes! (pytorch based)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The ratio in ppo.py should be detach() ?

dedekinds opened this issue · comments

hi, I think the ratio in ppo.py should be ratio.detach().

hi dedekinds, then where do you think policy network should be updated?

hi dedekinds, then where do you think policy network should be updated?

————————————
` for delta_t in delta[::-1]:
advantage = gamma * lmbda * advantage + delta_t[0]
advantage_lst.append([advantage])
advantage_lst.reverse()
advantage = torch.tensor(advantage_lst, dtype=torch.float)

        pi = self.pi(s, softmax_dim=1)
        pi_a = pi.gather(1,a)
        ratio = torch.exp(torch.log(pi_a) - torch.log(prob_a))  # a/b == exp(log(a)-log(b))

        surr1 = ratio * advantage
        surr2 = torch.clamp(ratio, 1-eps_clip, 1+eps_clip) * advantage`

————————————————
hi, I think the gradient should come from advantage instead of ratio * advantage. ratio should be a correction term.
The difference is mainly reflected in the stage of backpropagation.

Did you double check PPO's objective function?
pi_theta(=pi_a) has to be updated in same time. ratio is to update policy network as well as block excessively large policy update.

Did you double check PPO's objective function?
pi_theta(=pi_a) has to be updated in same time. ratio is to update policy network as well as block excessively large policy update.

Thank you for your reply.

I can not understand why the delta was detach()? (i.e., Link)

Can I maintain the gradient of delta ?
For example, I change

delta = delta.detach().numpy()
advantage_lst = []
advantage = 0.0
for delta_t in delta[::-1]:
advantage = gamma * lmbda * advantage + delta_t[0]
advantage_lst.append([advantage])
advantage_lst.reverse()
advantage = torch.tensor(advantage_lst, dtype=torch.float)

to

adv_list = []
adv = torch.zeros(1)
for index in range(len(delta)-1,-1,-1):
adv = gamma * lmbda * adv + delta[index]
adv_list.append(adv)
adv_list.reverse()
advantage = torch.stack(adv_list)

But it does not work, even though the value of two advantage are the same. Why the advantage can not have a gradient?

I apologize for my carelessness. At first, I thought that your ppo was policy gradient + importance sampling. As this reason, the objective function will be the shape as ratio * \sum log (pi_a) * G, where ratio = pi_a/prob_a. In this situation, I think the ratio should be ratio.detach(), right? Why will it have a bad RL result when I maintain the gradient of ratio in this situation?

Thank you again!

The PPO's objective is for update policy network. The delta's purpose is to update policy using unbiased estimator. there is no need to update value network with policy network. and will be not updated properly. Value network is updated by smooth l1 loss.

and next question, I don't understand properly. Sorry. did you make Agent that is updated by [ratio * \sum log (pi_a) * G], and make ratio detached but it doesn't work well, right?