Problem of `train_net()` in REINFORCE algorithm.

Question

Problem of `train_net()` in REINFORCE algorithm.

fuyw opened this issue 5 years ago · comments

Thanks for the high quality implementations. But I have a question about train_net() in REINFORCE algorithm:

    def train_net(self):
        R = 0
        for r, prob in self.data[::-1]:
            R = r + gamma * R
            loss = -torch.log(prob) * R
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
        self.data = []

The policy is updated for each step of an episode. But the policy is supposed to be updated after a complete episode (or batch of episodes).

Since after we do an update, the policy is changed, and the data collected is no longer useful according to the theory.

seolhokim · Answer 1 · Fri Nov 29 2019 12:28:14 GMT+0800 (China Standard Time)

That's right. policy has to be updated after for loop and i fixed it. you can find the changed code under this issue.

fuyw · Answer 2 · Fri Nov 29 2019 15:53:17 GMT+0800 (China Standard Time)

Thanks for your reply.

fuyw · Answer 3 · Fri Nov 29 2019 23:33:32 GMT+0800 (China Standard Time)

I find an interesting problem that: the policy converges slower after we fix the bug (compared in wall time).

Is this problem too easy that the policy can converge faster even it is updated per step? I don't know a reasonable explanation for this (or may it is just a coincidence). Could you please give me some hints?

Thank you very much.

seolhokim · Answer 4 · Sat Nov 30 2019 15:56:43 GMT+0800 (China Standard Time)

I didn't think about that. Good Question. I fixed slow converge problem.

But, master branch correctly implemented the REINFORCE pseudo code.
We are confused about why it doesn't care change of state distribution.
prob and r are in list(static value). So, updating every step doesn't effect to distribution.
and my code is updating all at once. I guess existing code is more good to understand.

Anyway, It was so interesting, I give up my exam next week for you. haha
Thanks.

fuyw · Answer 5 · Sat Nov 30 2019 16:08:35 GMT+0800 (China Standard Time)

Thanks for your help. Really sorry for disturbing your exam.
Good luck and hope you can do well in the exam.
Many thanks.