converger issue

Question

converger issue

fangthu opened this issue 8 years ago · comments

hi, Recently I used the your template to learn some simple maneuvers.

But I find the output always converge to -1 or +1 if the episodes is large enough, and if the output boundary is [-1 1].

Have you ever met with this situation? or do you know how to solve?

Best

Anjum Sayed · Answer 1 · Sun Dec 11 2016 00:21:54 GMT+0800 (China Standard Time)

I'm running into a similar issue. I tried adding a penalty for a certain number of repeat actions, however after a very large number of episodes it still converges onto an end member. I'd be interested to see if anyone else has a workaround to this issue

Patrick Emami · Answer 2 · Sat Dec 17 2016 01:00:35 GMT+0800 (China Standard Time)

I haven't had a chance to look into this issue, but an initial suggestion is to add a learning rate schedule with tf.train.exponential_decay for both networks. Also, set a desired loss or average reward, and then stop training the networks rather than continue updating the weights once you hit your target. If the networks have learned well enough before you have trained them for X number of episodes, stopping early is recommended to prevent the weights from sliding into some worse local minima.

Anjum Sayed · Answer 3 · Sat Dec 17 2016 01:21:24 GMT+0800 (China Standard Time)

I think with the Adam optimiser, you don't need learning decay. I did email David Silver about this, and he said It's usually possible to solve pendulum with bang-bang control - so if it's stabilising and achieving desired reward, maybe -1 or +1 is okay.

Patrick Emami · Answer 4 · Sat Dec 17 2016 01:25:14 GMT+0800 (China Standard Time)

Ah, right. Yeah, this implementation is pretty simple, so it works for a task like pendulum. More tricks and adjusting would definitely be needed for a more complex problem.

GoingMyWay · Answer 5 · Thu Jun 28 2018 11:42:52 GMT+0800 (China Standard Time)

@Anjum48

Hi, may I ask you a question, how can I plot such a chart as you posted with tensorboard.

Anjum Sayed · Answer 6 · Thu Jun 28 2018 13:43:20 GMT+0800 (China Standard Time)

Hi @GoingMyWay, it's a bit tricky but you can create a histogram for TensorBoard using a Numpy array. I used a custom function which does this (bit hacky, but I haven't found a better way yet) https://github.com/Anjum48/rl-examples/blob/master/dppg/ddpg.py#L204

Hope this helps!

GoingMyWay · Answer 7 · Thu Jun 28 2018 13:49:12 GMT+0800 (China Standard Time)

@Anjum48

Thank you, BTW, how many episodes will it take to train Pendulum-v0, I trained it with 10k episodes, but it doesn't converge now.

Anjum Sayed · Answer 8 · Thu Jun 28 2018 14:09:25 GMT+0800 (China Standard Time)

I found that in my implementation of DDPG (which is pretty similar to how @pemami4911 did it), it converges after 100-200 episodes (FYI, I can't get it to learn this fast with other algorithms e.g. A3C or PPO).

In my experience, DDPG is very sensitive to how the OU noise is added to the actions, so I added an exponential decay like this:

epsilon = np.exp(-i/TAU2)
a += epsilon * exploration_noise.noise() / env.action_space.high

with TAU2 = 25 (this should be dependent on the environment). An interesting area of research which I still need to try is adding noise to the network parameters rather than the actions (see https://github.com/openai/baselines/tree/master/baselines/ddpg)

GoingMyWay · Answer 9 · Thu Jun 28 2018 14:18:07 GMT+0800 (China Standard Time)

@Anjum48

Thank you, I will run your code. The results from pemami4911's code is

From curves of avg max Q and reward, I couldn't tell if it converges or not.

Anjum Sayed · Answer 10 · Thu Jun 28 2018 14:40:32 GMT+0800 (China Standard Time)

@GoingMyWay I suspect that it is converging, but because the noise term is still being added to the actions (i.e. it hasn't decayed to zero after learning), the actions are too noisy to get a smooth looking reward curve. For example, the Pendulum might be nicely balanced in the upright position, but the random noise added to the actions will knock it off balance, hence causing the poor scores