Fails to learn

Question

Fails to learn

KeirSimmons opened this issue 7 years ago · comments

I ran the notebook without any changes on the vizdoom environment. After around an hour the reward became non-negative and peaked at around 0.7, but continuing to run the code resulted in the reward going back to -3.0 (I assume the most negative reward possible) and remaining stagnant for over 24 hours. A view of the produced gifs shows the agent walking to the left continuously without choosing any other action.

I also attempted to change the environment to OpenAI's Pong-v0 and have run this for over 24 hours without the average reward improving at all. If anyone knows what variables might be worth changing here I'd be grateful. I'm using 80x80 pong images and allowing for all 6 actions to be chosen. Code otherwise is not different (apart from of course modifying the 'game' variable to work with the openai environment (tested manually - successful).

Keir Simmons · Answer 1 · Mon Jun 05 2017 08:40:39 GMT+0800 (China Standard Time)

Have left it running on Pong for 4 days now, 8-core CPU roughly 10,000 episodes each. Average reward (over last 100 episodes) has not increased even slightly. Sits around -20.5 to -21.0 (min reward).

This is also the same with the default vizdoom setup, min reward is achieved. Has this code-base been tested by anyone else?

Matheus Mendonça · Answer 2 · Fri Jul 07 2017 22:13:04 GMT+0800 (China Standard Time)

I had the same problem for both the doom environment and for Pong. For the Doom environment, I only trained the agent using the parameters that were defined in the current version, but for the Pong environment, I tested several different networks architecture with different learning rates and optimizers.

But, in the end, I couldn't get it to work even after training for one day. In fact, it ended up converging to a policy where it always moves up or always moves down (not both). I then tried recreating this code from scratch, but ended up with the same problem (my code is in my github account).

I would really appreciate any hints.

Derek Tishler · Answer 3 · Fri Jul 14 2017 20:15:24 GMT+0800 (China Standard Time)

I have also fought allot with using different environments and the network always going off to 1 action and accomplishing nothing. It works well enough for the provided doom example, but tuning the network for any other task seems oddly difficult. @IbrahimSobh really seemed to fight with this in the doom Health level. His use of Skip Frames helped save time, but the inability to learn such tasks had me wondering if there was a more fundamental problem going on.

Matheus Mendonça · Answer 4 · Sun Jul 16 2017 09:41:54 GMT+0800 (China Standard Time)

I managed to get the network to learn a good policy for the given Doom environment. My mistake was that, in order to fix the Nan problem that the original version presents (this happens when the policy outputs a zero value for an action, resulting a nan after taking the log of zero), I added a small value to the policy (1e-8). But then I realized that this wasn't a very small value and it was interfering with the results. By changing the small value to 1e-13, the network converged for the Doom environment in about 6k episodes (for all threads).

But the problem still persists for the Pong environment.....I still can't get it to learn, since it is still converging to a bad policy where the agent executes only one action. I'm actually using a different network setup as the one used by the OpenAI gym's A3C implementation, which is 4 convolutional layers with 32 filters, 3x3 kernels and strides of 2, followed by the LSTM layer with an output of 256 (it doesn't use the hidden layer between the convolutional and LSTM layers). It also uses the Adam optimizer with a learning rate of 1e-4. But it still doesn't work.......

I really don't understand why it still doesn't work.....

zhaolewen · Answer 5 · Sun Nov 05 2017 17:50:01 GMT+0800 (China Standard Time)

Hello,
I'm implementing A3C without LSTM based on this repository and others. And one thing that I'm pretty sure that is broken is the shared optimizer.
I'm trying the SpaceInvaders-v0 env. When the optimizer object is passed naïvely to the threads, the game score stays pretty much at the level of random play. When it's one optimizer per thread, the current performance is about 280 points in average(random play is about 140).
And my code is still running, but the speed of growth is a bit disappointing, since I'm using 16 threads.

By the way, in Denny Britz's repository, it's separate optimizers per thread.
https://github.com/dennybritz/reinforcement-learning/blob/master/PolicyGradient/a3c/estimators.py
In this A3C repo developed with Pytorch, it's also using the normal optimizers one per thread, but the author has also written a shared version of Adam and RMSProp.
https://github.com/dgriff777/rl_a3c_pytorch/blob/master/shared_optim.py

Máté · Answer 6 · Thu Dec 14 2017 18:22:52 GMT+0800 (China Standard Time)

@zhaolewen Which TF version are you using?

I also implemented my own repo based on this. With newer TF versions (>1.0) it seemed to work. However I had to revert to 0.8 due to some hardware issues (I'm trying to run my network on a TK1 board). Now it does not seem to learn anything... Of course I had to do tons of changes to make my code compatible with the old python APU, but I'm just wondering if this shared optimizer issue can be related to earlier TF versions.

zhaolewen · Answer 7 · Thu Dec 14 2017 18:30:06 GMT+0800 (China Standard Time)

Hi @mkisantal
I'm using TF 1.2 I think. Well if it's caused by difference between TF versions, that would be quite tricky ...

Máté · Answer 8 · Thu Dec 14 2017 18:41:30 GMT+0800 (China Standard Time)

Now I'm running a test with separate optimizers, to see if it solves the issue. But yeah, there might be tons of other reasons for the problems I'm experiencing, as I reverted from 1.4 back to 0.8, and TF has been under a heavy development between these versions.