Do something about the actor-critic Coursera assignment

Question

Do something about the actor-critic Coursera assignment

dniku opened this issue 4 years ago · comments

Dmitry Nikulin commented 4 years ago

Currently, week5_policy_based/practice_a3c.ipynb has numerous problems.

It does not implement A3C. It is a plain actor-critic.
We only have it in Tensorflow, since it does not have a corresponding assignment in master (it is a heavily modified version of master/week08/practice_pomdp which was never originally intended to be an actor-critic assignment).

The difficulty is fixing this is that the videos that lead up to this assignment talk about A3C a lot.

Ahmed Nabil · Answer 1 · Sun Nov 13 2022 20:43:48 GMT+0800 (China Standard Time)

@dniku, Have you done anything so far about this problem? Besides, I am facing issues in the assignment of week08 related to this issue. I am trying to fix both now; the policy loss and the reward are not correct in both studies (week08 and week06), although I have done everything that I can be done to fix both of them. I want to know if it is something regards the atari_util.py file and not in our code!!!

Ahmed Nabil · Answer 2 · Sun Nov 13 2022 20:46:48 GMT+0800 (China Standard Time)

Some screenshots of both cases;

Week 06:

Week 08:

Dmitry Nikulin · Answer 3 · Mon Nov 14 2022 00:00:57 GMT+0800 (China Standard Time)

@AI-Ahmed

We haven't done anything about this assignment; but this issue is about the Coursera assignment specifically, and not the ones in the master branch, which, I assume, you are talking about.

Your screenshots of plots seem to indicate that your agent isn't learning anything at all, and is behaving randomly. I'd guess that the reason is some bug in your code, e.g. a - sign missing before the loss. You may want to refer to some open-source implementation of A2C — e.g. this one — to compare it with yours and possibly spot some errors.

Ahmed Nabil · Answer 4 · Fri Nov 25 2022 23:40:32 GMT+0800 (China Standard Time)

Hello @dniku,
The problem was solved in both notebooks. The problem I had in both notebooks was related to multiplying is_not_done by the value_target. When I didn't do that, that added more potential to the imperfect value function rather than adding the possibility to the reward, and the agent thought that there was no end to the episode, which means that all the value functions for all the states have equality due to the continuity.