yandexdataschool / Practical_RL

A course in reinforcement learning in the wild

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Do something about the actor-critic Coursera assignment

dniku opened this issue · comments

Currently, week5_policy_based/practice_a3c.ipynb has numerous problems.

  • It does not implement A3C. It is a plain actor-critic.
  • We only have it in Tensorflow, since it does not have a corresponding assignment in master (it is a heavily modified version of master/week08/practice_pomdp which was never originally intended to be an actor-critic assignment).

The difficulty is fixing this is that the videos that lead up to this assignment talk about A3C a lot.

@dniku, Have you done anything so far about this problem? Besides, I am facing issues in the assignment of week08 related to this issue. I am trying to fix both now; the policy loss and the reward are not correct in both studies (week08 and week06), although I have done everything that I can be done to fix both of them. I want to know if it is something regards the atari_util.py file and not in our code!!!

Some screenshots of both cases;

Week 06:
image

Week 08:

image

@AI-Ahmed

We haven't done anything about this assignment; but this issue is about the Coursera assignment specifically, and not the ones in the master branch, which, I assume, you are talking about.

Your screenshots of plots seem to indicate that your agent isn't learning anything at all, and is behaving randomly. I'd guess that the reason is some bug in your code, e.g. a - sign missing before the loss. You may want to refer to some open-source implementation of A2C — e.g. this one — to compare it with yours and possibly spot some errors.

Hello @dniku,
The problem was solved in both notebooks. The problem I had in both notebooks was related to multiplying is_not_done by the value_target. When I didn't do that, that added more potential to the imperfect value function rather than adding the possibility to the reward, and the agent thought that there was no end to the episode, which means that all the value functions for all the states have equality due to the continuity.