seungeunrho / minimalRL

Implementations of basic RL algorithms with minimal lines of codes! (pytorch based)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problem of `train_net()` in REINFORCE algorithm.

fuyw opened this issue · comments

commented

Thanks for the high quality implementations. But I have a question about train_net() in REINFORCE algorithm:

    def train_net(self):
        R = 0
        for r, prob in self.data[::-1]:
            R = r + gamma * R
            loss = -torch.log(prob) * R
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
        self.data = []

The policy is updated for each step of an episode. But the policy is supposed to be updated after a complete episode (or batch of episodes).

Since after we do an update, the policy is changed, and the data collected is no longer useful according to the theory.

That's right. policy has to be updated after for loop and i fixed it. you can find the changed code under this issue.

commented

Thanks for your reply.

commented

I find an interesting problem that: the policy converges slower after we fix the bug (compared in wall time).

Is this problem too easy that the policy can converge faster even it is updated per step? I don't know a reasonable explanation for this (or may it is just a coincidence). Could you please give me some hints?

Thank you very much.

I didn't think about that. Good Question. I fixed slow converge problem.

But, master branch correctly implemented the REINFORCE pseudo code.
We are confused about why it doesn't care change of state distribution.
prob and r are in list(static value). So, updating every step doesn't effect to distribution.
and my code is updating all at once. I guess existing code is more good to understand.

Anyway, It was so interesting, I give up my exam next week for you. haha
Thanks.

commented

Thanks for your help. Really sorry for disturbing your exam.
Good luck and hope you can do well in the exam.
Many thanks.