Problem while using the code

Question

Problem while using the code

originholic opened this issue 8 years ago · comments

Thanks for sharing the code, please ignore the title, I tried out your code with the control problem of cartpole balance experiment instead of Atari game, it works well. But few questions want to ask.

I am curious, in the asynchronous paper, they also used another model implementation with 1 linear layer, 1 LSTM, layer, and softmax output, I am thinking of using this model to see whether improve the result, can you suggest how the LSTM can be implemented using tensorflow in the case of playing atari game?

Also wondering that the accumulated states and reward were reversed, do you need to reverse the actions and values as well? Although it did not make any different when I tried out, just wondering why.

states.reverse()
rewards.reverse()

Last, do you really need to accumulate the gradient and then apply the update, since tensorflow can handle the 'batch' for update.

Kosuke Miyoshi · Answer 1 · Fri Apr 15 2016 16:27:10 GMT+0800 (China Standard Time)

I'm really glad to know that you've succeeded to reproduce continuous model.

About the reverse(), as you say, I've forgot to add

actions.reverse()
values.reverse()

in addition to

states.reverse()
rewards.reverse()

and I've pushed the fix just now. Let me explain why there are reverse() of lists.
In pseudo code of A3C algorithm in deep mind paper, there is

for i in {t-1, ...., tstart} do

This means that "i" will decrease like

t-1, t-2, t-3 ... tstart

This is why I put these reverse() for lists collected in the loop.

As far as I tried with Atari model, I still couldn't reproduce good learning result. I was planning to implement LSTM after being able to reproduce good Atari result, but as you say that you've succeeded with continuous model, I should try LSTM now.
Please wait.

About the batch, let me think about whether it is possible to replace it with batch or not.
Just a moment please.

Thank you for suggestions!!

originholic · Answer 2 · Fri Apr 15 2016 20:06:42 GMT+0800 (China Standard Time)

Many thanks for your reply, and glad to hear that you also plan to work on the LSTM model.

I just uploaded the testing codes (based on this repo) for the "batch" update that mentioned.
https://github.com/originholic/a3c_vrep.git

I only tested it with the cartpole balance domain, but somehow I found it actually take longer time to reach the desired score than your implementation. I will try to investigate this later, as now I will continue to work with your implementation to study the LSTM model, which I am not familiar with.

Also Instead of constant learning rate:

math.exp( log_lo * ( 1-rate ) + log_hi * rate)

Don't know whether the random initialization of learning rate that mentioned in the paper can help to improve the results:

math.exp( random.uniform( log_lo, log_hi ) )

Kosuke Miyoshi · Answer 3 · Sat Apr 16 2016 00:36:58 GMT+0800 (China Standard Time)

I just uploaded the testing codes (based on this repo) for the "batch" update that mentioned.
https://github.com/originholic/a3c_vrep.git

Thanks! I'll try.

About the LSTM, I'm also new to LSTM and just started studying it recently, so please don't expect too much! However I'm really interested in 3D labyrinth model with LSTM, so I would like to try it.

About the randomizing learning rate with log_uniform, I also used to randomize initial learning rate for each threads with log_uniform before. However when I looked at the figure on page 23, I found that learning rate varies from 10^-4 to 10^-2, uniformly distributed with log scale sampling.

So my understanding of log_uniform function is that they are using log_uniform for finding best hyper parameter when they use grid search for it.

(In the graph on page 14 and page 22, they are also using log scale for grid searching parameters.)

However, I'm not sure my understanding is correct.

originholic · Answer 4 · Sat Apr 16 2016 14:38:38 GMT+0800 (China Standard Time)

Thanks for pointing out.

After re-think about the random initialization, I think you are right about it, the initial learning rates sampled from the LogUniform range were used to demonstrate the sensitivity of their methods. And it makes sense that a constant (or best choice) of learning rate is apply for RMSPROP and decay to zero over time.

Sorry, it is my bad, just got confused when they use the phrase in the paper.

"each using a different random initialization and initial learning rate"

Kosuke Miyoshi · Answer 5 · Sat Apr 16 2016 21:10:38 GMT+0800 (China Standard Time)

No problem. Any suggestion and discussions are always welcomed. Thanks!

originholic · Answer 6 · Mon Apr 18 2016 13:27:45 GMT+0800 (China Standard Time)

Hi, tuning in again.
May I ask that in the case of continuous action domain of the asynchronous paper, they used two policy outputs of a linear layer and a Softplus activation + linear layer to represent the mean and variance. I am wondering how policy loss can be calculated with two outputs?

self.policy_loss = -( tf.reduce_sum( tf.mul( tf.log(self.pi), self.a ) ) * self.td + entropy * entropy_beta )

I am thinking of calculating the loss separately, by making 2 of the above policy loss function for the outputs. Does this make sense to you?
Sorry this might be out the scope of your interest, as 3D labyrinth doesn't require continuous actions, but any suggestion is highly appreciated. Many thanks!

Kosuke Miyoshi · Answer 7 · Tue Apr 19 2016 01:13:21 GMT+0800 (China Standard Time)

I have never tried continues model, so today I looked into other Simple Cart-Pole Actor-Critic sample without NN to learn about it.
How to define loss for policy for continues action is still difficult for me, so I'll try continues model with creating branch.
(I'm also interested in continuous model too)
Maybe it will be natural to make two different loss function for mean and variance, but I'm not sure now.
I'll try to figure out.

By the way, even with the discrete action model I'm implementing now, the policy loss function is the most difficult part for me and still I'm not certain this is 100% correct.

However, when I tried simple 2D grid maze model (which I implemented in debug_maze branch), this program succeeded to find shortest path with this policy loss function. So the loss function for discrete action seems fine.

Anyway, I'll report here if I found any result with continuous model.

originholic · Answer 8 · Tue Apr 19 2016 13:10:02 GMT+0800 (China Standard Time)

Thanks for the reply.
As far as I know from your codes, the policy loss function for discrete domain is calculated using the negative log-likelihood of the softmax function.

After doing some searches, may be I can apply the same loss function eg. negative log-likelihood, but instead of softmax function, a Gaussian (Normal) distribution function can be used instead since the outputs have mean and variance. So I think the loss function looks like by following the formula, where sigma2: variance, and mu: mean,

D = tf.to_float(tf.size(self.a))
x_prec = tf.exp(-tf.log(self.sigma2))
x_diff = tf.sub(self.a, self.mu)
x_power = tf.square(x_diff) * x_prec * -0.5
gaussian_nll = (tf.reduce_sum(tf.log(self.sigma2)) + D * tf.log(2 * np.pi)) / 2 - tf.reduce_sum(x_power)
self.policy_loss = gaussian_nll * self.td + entropy_beta * entropy

Sorry for the messy typing, I will try this out to see whether it works for the continuous cartpole domain, and let you know how this goes.
Thanks

Kosuke Miyoshi · Answer 9 · Tue Apr 19 2016 22:24:59 GMT+0800 (China Standard Time)

Is this the explanation of this loss function?

http://docs.chainer.org/en/stable/reference/functions.html#chainer.functions.gaussian_nll

I really want to know the result. There is a lot to learn from this thread for me. Super thanks!!

originholic · Answer 10 · Wed Apr 20 2016 12:25:30 GMT+0800 (China Standard Time)

Yes, that's right, the negative log-likelihood of normal distribution is from the chainer site, but I also found another called maximum log-likelihood, I think they are the same thing by looking into the formula alone. Same here, there are lots methods out there waiting to be learned and get confused.

Tried the loss function based on your code, it works moderately well with cart-pole balance task of continuous action domain, at least it is able to converge (or said reach the desired score). But possibly need some more examples to study the codes in order to draw conclusion that the loss function actually works for continuous action. So keep working on it!! Thanks.

However when I turned back to try it with the "batch" method, it reached a score of around 2000(desired score was 3000), the network somehow diverged immediately(I am not pretty sure it was diverged or not, or explode, the network just gave "NaN" for its output all the time).

Kosuke Miyoshi · Answer 11 · Wed Apr 20 2016 22:13:32 GMT+0800 (China Standard Time)

Thank you for reporting.
I was trying batch with my discrete action code in "batch" and "debug_maze_batch" branch.
I'm checking whether gradient accumulation is working correctly when batched.

Deleted user · Answer 12 · Mon Apr 25 2016 20:32:08 GMT+0800 (China Standard Time)

@miyosuda:
Hey, I had been trying to implement the same, on Theano. Implemented an A2C version (single thread), which obviously never converges inspite of training on GPU, for even a week or so... Came across your git source. Could you please let me know what exactly are the issues that you are facing right now, that makes your learning still not as good as required? Is it NaNs and stability; or no-convergence of the network? We can try to catch up on this, as I am also in urgent need to have an Actor Critic learner on Pong.

Kosuke Miyoshi · Answer 13 · Mon Apr 25 2016 22:15:56 GMT+0800 (China Standard Time)

@aravindsrinivas
Thank you for joining the discussion.
Let me explain what I tried, what I succeeded and what I have not succeeded yet.

I have beed trying pong with A3C with CPU 8 threads.
The problem is that the score of the game does not increase even with one or two days learning.
The AI can hit back tree or four times in one game, but the score does not increase like the deep mind paper shows.

(As far as I'm trying with pong, the network does not diverges like NaN)

To confirm whether my implementation has a problem or not, I tried easier task.
I implemented 10x10 grid 2D maze, and let this A3C algorithm find the shortest path.
After running two or thee minutes, the AI converged to optimal result. (It succeeded to find the shortest path)

I tried this in "debug_maze" branch.

After confirming that this algorithm can solve easy RL task, I'm changing hyper parameter little by little to check whether the game score will increase like paper shows.
But the result is still same.

I once heard that the DQN is very sensitive with hyper parameters, and as far as I see the paper, hyper parameters of this method seems sensitive.

Along with tuning hyper parameters, I'm also planning to try another task, task that doesn't use CNN.

By the way, the key concept of this method is to get the stability of the network by running multiple threads at the same time, not to diverge or oscillate.
So if you have problem with single thread, how about trying multiple threads?

I have never tried Theano, but if you would like to run it with TensorFlow, I can help you.

Deleted user · Answer 14 · Mon Apr 25 2016 22:27:29 GMT+0800 (China Standard Time)

@miyosuda
I mailed the authors (from DeepMind). These are some hyper parameters that they explicitly told me in the mail:

The decay parameter (called alpha in the paper) for RMSProp was 0.99 and the regularization constant (called epsilon in the paper) was 0.1. The maximum allowed gradient norm was 40. The best learning rates were around 7*10^-4. Backups of length 20 were used which corresponds to setting the t_max parameter to 20.

Also, I am not sure if you used the frame skip in your implementation. From what I saw in the game_state.py, you just have a reward = ale.act(action)? Shouldn't it be in a for loop like
for _ in range(frame_skip):
reward += ale.act(action)

Also, are you clipping the reward to lie between -1 and 1? In DQN, rewards were clipped between -1 and 1. I am not sure what the rewards are for Pong from the ALE src.

Kosuke Miyoshi · Answer 15 · Mon Apr 25 2016 23:11:48 GMT+0800 (China Standard Time)

Wowwowow! They are the parameters that I really wanted!!! Super thanks!!

I was always using t_max with 5 and didn't use gradient norm clipping.
(In the paper there was only one line just referring gradient norm clipping, so I didn't tried it)

also, I am not sure if you used the frame skip in your implementation

I've set the frame skipping with every 4 frames at "ale.cfg" file, but as you say it might better to put loop as you suggested.

I'm not clipping the score, but ALE pong gives reward 1 or -1, so it will be ok.

Anyway, super thanks for giving me such a valuable information!!! I'll try these parameters.

Deleted user · Answer 16 · Tue Apr 26 2016 00:52:09 GMT+0800 (China Standard Time)

@miyosuda
Another question: How exactly are you synchronizing the RMSProp parameters?

Kosuke Miyoshi · Answer 17 · Tue Apr 26 2016 01:41:44 GMT+0800 (China Standard Time)

I'm accumulating gradient t_max times in each thread, and after that I'm applying these accumulated gradients with shared RMSProp. When applying accumulated gradient, "rms" parameter is shared among threads. (The "rms" parameter in TensorFlow corresponds to "g" in the paper.)
The "momentum" parameter in RMSProp can be shared, but I'm not using momentum in RMSProp because there was no referring with momentum in RMSProp in the paper.
(I'm applying 0.0 as momentum constant in RMSProp)

When applying accumulated gradients with shared RMSProp, I'm not using any synchronization like mutual exclusion among threads.

(Is this what you are asking?)

As far as I see the source code of TensorFlow, it seems ok to apply gradient without lock when running on CPU.
(To run it on GPU, I need to research more to check whether we can implement shared RMSProp with GPU or not, because memory handling on GPU might be different from CPU.)

Deleted user · Answer 18 · Tue Apr 26 2016 06:20:17 GMT+0800 (China Standard Time)

Shouldn't we lock another thread from updating the parameters of the global network, when one particular thread is already updating it with its accumulated gradient from t_max steps?

My question was related to RMSProp previous gradient values. We do a moving average of the RMS of the gradients right? And the RMS is used to determine our update of the parameters. My question is: Would the gradient values of different threads all be used together to update the moving average of the RMS? Or do we have separate moving averages for each thread, which is used when that corresponding thread is updating the parameters using its accumulated gradient?

In the paper, they consider both the approaches, but say that having separate RMSProp parameters (mainly the moving average) has less robustness than sharing the moving average. But they don't reveal how exactly they synchronize the moving average across threads.

Could you explain what you are doing?

Kosuke Miyoshi · Answer 19 · Tue Apr 26 2016 07:26:16 GMT+0800 (China Standard Time)

@aravindsrinivas
Sorry my mistake, while checking my code, I found that the moving average of RMSProp is not shared. So my current implementation is not shared RMSProp.

I've created RMSPropApplier class in rmsprop_applier.py
In this class, slot named "rms" corresponds to parameter "g" in the paper.

(The "rms" slot parameter will be passed to native code around here)
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/training_ops.cc#L143

I created this class to share this "rms" parameter among threads, but I found that RMSPropApplier class is created in each thread in a3c_training_thead.py.

So the moving average is calculated differently in each thread. I need to fix this.
Sorry about that.

Deleted user · Answer 20 · Tue Apr 26 2016 13:26:06 GMT+0800 (China Standard Time)

@miyosuda

Hi, I also confirmed that

the critic learning rate must be half the actor's..
the LR must be linearly annealed to 0 over the course of training.
the parameters 'g' and 'theta' (moving average of RMS of gradients and of course the parameters) are shared across the threads. (Unlike your earlier version of having separate RMS moving averages). Also, there is no need of locking and updating.
t_max = 20 means 20 perceived frames (80 with Frame skip as per game)... Not 20 states .. ie not 20 84_84_4 tensors, but rather 20 84*84 frames...

A question: Could you tell me at what speed (steps per second) [where step refers to a decision taken from the network in gameplay] the code runs, for the 8 thread version? DeepMind says they get 1000 steps/sec from 16 threads, and thus for a single thread, it should be 70. But I was never able to run at 70 for my single thread code.. It used to run at 30.

Kosuke Miyoshi · Answer 21 · Tue Apr 26 2016 15:20:13 GMT+0800 (China Standard Time)

@aravindsrinivas

Thank you for providing such a valuable information again.

the critic learning rate must be half the actor's..

I got it. I'll set the LR for actor starting from 7*10^-4, and 3.5 * 10 ^ -4 for critic.

the LR must be linearly annealed to 0 over the course of training.

I got it. I've already implemented LR annealing.

the parameters 'g' and 'theta' are shared across the threads.

I see. I'm now testing sharing 'g' in "shared_rmsprop" branch. Later I'll merge this to master branch after confirming.
In my implementation, 'theta' corresponds to variables in global_network instance.

t_max = 20 means 20 perceived frames

I wanted to ask about this too.
I used to implement frame skipping with "ale.cfg" file with "frame_skip=4" option.
When using this option, every time we call ale.act(chosen_action), frame will advance 4 frames.

So I was storing frames for each state during one backup sequence (sequence of 5 states) like this.

(pattern A)
state[0] = { 0  4  8 12}     <- frame0, 4, 8, 16
state[1] = { 4  8 12 16}
state[2] = { 8 12 16 20}
state[3] = {12 16 20 24}
state[4] = {16 20 24 28}

With this pattern, each adjacent states are sharing three perceived frames.

Another way to store frames with 4 frame skipping is

(pattern B)
state[0] = { 0  4  8 12}
state[1] = {16 20 24 28}
state[2] = {32 36 40 44}
state[3] = {48 52 56 60}
state[4] = {64 68 72 76}

If we choose pattern B, one chosen action will continue along with 16 frames.
How should we implement frame skipping with t_max=20?
If you have any idea about this, please teach me.

About the speed of running steps, I'll check it on my environment and please wait a minute!

Kosuke Miyoshi · Answer 22 · Tue Apr 26 2016 16:30:44 GMT+0800 (China Standard Time)

@aravindsrinivas
I've checked the running speed.
I'm outside now, so I checked it with my MacBookPro (Intel Core i7 2.5GHz).

It was 106 steps per second with 8 threads. So it runs 13 steps per thread.
I have another Core i7-6700 Desktop machine, and I remember it was x1.5 (or x2?) times faster than my MacBookPro.
(I'll check with Core i7 machine later)

Anyway, speed on my environment is much slower than DeepMind's.

Deleted user · Answer 23 · Tue Apr 26 2016 16:43:23 GMT+0800 (China Standard Time)

@miyosuda

That's quite slow I guess... Maybe I got 30 steps per second for single thread, because of the GPU. I can't understand how DeepMind got it working with 70 steps/sec for a single thread. That's actually almost as fast as running DQN on GPU.
So, your code is about 5 times slower than DeepMind's I guess... But we can still reproduce results with 1-2 days of running.....

Deleted user · Answer 24 · Tue Apr 26 2016 17:22:40 GMT+0800 (China Standard Time)

@miyosuda
When I implemented, I had it the same way as Pattern - A (0,4,8,12), (4,8,12,16), (8,12,16,20) , .... Even in DQN, that's the way they do it.

What we should do is - say we are '0', we take an action, repeat it 4 times. We would execute 0->1, 1->2, 2->3 and 3->4 using the same action that was decided at 0. We now again decide on action at '4', execute 4->5, 5->6, 6->7, 7->8 (4 repetitions) and decide on an action at '8', .. and so on.

Our states would be (0,4,8,12); (4,8,12,16); (8,12,16,20) .... . Since they say t_max is equivalent to 20 perceived frames, we must stop at (64,68,72,76). ie , you stop once you decide on an action at 76th frame, and repeat it 4 times to get to the 80th frame. 80th frame (with past 3 perceived frames 68,72,76) would be our s_{t_max} , which is used to calculate our target through V(s_{t_max}). We would have 17 tuples (0,4,8,12) , (4,8,12,16) ..., (64,68,72,76) for s_{t} for t = 0 to t_max - 1. s_t_max would be (68,72,76,80)..

I will actually try to implement a Theano version of this now that so many details are clear.. Please keep updating on whether you are able to implement it.

Joakim Bergdahl · Answer 25 · Tue Apr 26 2016 17:25:57 GMT+0800 (China Standard Time)

Hi, I also found this repo after trying to implement A3C from the DeepMind article, it's nice to see progress! However, when running the implementation the agents seem to only perform three actions from the legal action set given from the ALE interface and these actions correspond to idle, fire and right. Could this be a result from the provided pong binary being problematic or the set ACTION_SIZE of 3? The reason I'm asking is that when displaying the results from training a few hours, the paddle is stuck at the environment edge of the pong playing field.

Deleted user · Answer 26 · Tue Apr 26 2016 17:33:31 GMT+0800 (China Standard Time)

@joabim
I think it is because the ACTION_SIZE is set to 3. He is using only the legal actions allowed for the Pong game, and Pong has only 3 actions (moving up/down/staying idle).

Joakim Bergdahl · Answer 27 · Tue Apr 26 2016 17:40:44 GMT+0800 (China Standard Time)

@aravindsrinivas
You're right! But for some reason, instead of up/down/idle my runtime printouts seem to suggest that the agents perform the actions noop/idle (0), fire (1) and right (3) (which corresponds to up when testing pong.bin in the the stella emulator) according to the arcade learning environment documentation but maybe I'm misinterpreting the minimal action set. Do you get a moving paddle?

Kosuke Miyoshi · Answer 28 · Tue Apr 26 2016 18:06:34 GMT+0800 (China Standard Time)

@aravindsrinivas
Now I understand what you mean. I'll try that way too. Thanks!

@joabim
Thank you for joining the discussion.
It seems strange to get [0, 1, 3] from pong game rom.

I tried this code,

from ale_python_interface import ALEInterface
ale = ALEInterface()
ale.loadROM("pong.bin")
real_actions = ale.getMinimalActionSet()
print "minimal actions=", real_actions

and I got the result

minimal actions= [0 3 4]

[0, 3, 4] means [idle, right, left]
Could you try the code above?

Kosuke Miyoshi · Answer 29 · Tue Apr 26 2016 18:11:10 GMT+0800 (China Standard Time)

@joabim
Ah there is another function named getLegalActionSet() in ale, and I also tried it.

from ale_python_interface import ALEInterface
ale = ALEInterface()
ale.loadROM("pong.bin")
leagl_actions = ale.getLegalActionSet()
print "legal actions=", leagl_actions

and the result was

 legal actions= [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17]

So I think getLegalActionSet() is just returning all default actions. Which function are you using, getMinimalActionSet() or getLegalActionSet()?

Joakim Bergdahl · Answer 30 · Tue Apr 26 2016 18:11:28 GMT+0800 (China Standard Time)

@miyosuda
Exactly, when I invoke getMinimalActionSet() I get

[ 0 1 3 4 11 12]

so I'm wondering what has happened. I have tried rebuilding ALE but it doesn't change. When just running the code, the score remains at -21. By setting (forcing)

self.real_actions = [0, 3, 4]

I actually get some results from the training as to be expected. It's really weird that I can't get the real action set from ALE!

Kosuke Miyoshi · Answer 31 · Tue Apr 26 2016 18:16:22 GMT+0800 (China Standard Time)

@joabim
Are you using same rom as I'm using?
And the file name of the rom should be "pong.bin"
(Because ALE seems detecting the game rom type from the file name)

Joakim Bergdahl · Answer 32 · Tue Apr 26 2016 18:25:02 GMT+0800 (China Standard Time)

@miyosuda
Indeed I am! However, I'm using python 3.5 and anaconda, there could be some problem using loadROM(...) and bytes literal as input (I had to prefix the "pong.bin" with b to get it working), the ALEInterface displays the correct information in the terminal though... I could set up a python 2.7 environment and try again

zhuchiheng · Answer 33 · Sat Apr 30 2016 12:25:41 GMT+0800 (China Standard Time)

@originholic
Hi , I adjusted the epsilon in your implement to 0.1 and it converged at last(T=1901383 in my test). Thanks to the hyper parameters from @aravindsrinivas

And, I don't know why you @originholic would use a lock for the "env" in the training thread. Each thread has its own "env" object from what I understand they don't need the lock...

Deleted user · Answer 34 · Sat Apr 30 2016 15:12:40 GMT+0800 (China Standard Time)

I think the entropy beta must also be set to 0.01. It is 0.1 in the constants.py file. In the paper, it is mentioned as 0.01.

originholic · Answer 35 · Sat Apr 30 2016 16:00:02 GMT+0800 (China Standard Time)

@zhuchiheng
Thanks for trying out the code, yes the lock for the env is not actually required, it is there because I copied directly from my other project, and didn't manage to clean up the code, and I think it can still reach the desired scores as long as the epsilon is larger than 0.001 if you run the cart-pole environment. Also need to fix the initial learning rate instead of random initialization, probably reach the desired score faster.

@miyosuda, @aravindsrinivas
Wow, I think I will have a lot to catch up since been away for a while, and very thanks @aravindsrinivas for the hyper-parameters and helpful suggestions, regarding to the speed, I agree that mentioned due the GIL of python... it is not able to utilize the CPU efficiently for multi-threading, possibly instead of using threading module, the multiprocessing library can probably help to speed up.

Kosuke Miyoshi · Answer 36 · Sat Apr 30 2016 16:02:45 GMT+0800 (China Standard Time)

@aravindsrinivas
I see. As you say, entropy regularization constant is written as 0.01 in page 11.

By the way, you wrote in previous post that

The decay parameter (called alpha in the paper) for RMSProp was 0.99 and the regularization constant (called epsilon in the paper) was 0.1.

What does "epsilon" in this comment mean?
Epsilon term in RMSProp calculation in equation (9) in page 9?
(I thought that epsilon in RMSProp is a small constant like 1e-10 to avoid zero division, but if your comment means this term, I'll try it!)

And did you hear anything from DeepMind author about discount factor gamma?

Kosuke Miyoshi · Answer 37 · Sat Apr 30 2016 16:17:32 GMT+0800 (China Standard Time)

@originholic
Thank you for introducing multiprocessing library in python. I didn't know this. Let me check it.

Kosuke Miyoshi · Answer 38 · Sun May 01 2016 20:06:26 GMT+0800 (China Standard Time)

@joabim
I found why your minimal action set size differs from mine.

ALE seems to have changed minimal action set for pong one month ago.

Farama-Foundation/Arcade-Learning-Environment@e1c811a

I'm now using forked version of ALE on which I added some modification in order to run on multithreaded environment.

https://github.com/miyosuda/Arcade-Learning-Environment

This version still used 3 actions.
I don't know why they changed action size from 3 to 6.

Joakim Bergdahl · Answer 39 · Sun May 01 2016 21:50:59 GMT+0800 (China Standard Time)

@miyosuda
Thank you so much! That explains it!

@aravindsrinivas
Awesome work on the hyper parameters! I wonder if the source code for this project will be released some time like they did with DQN in the Human-level control through deep reinforcement learning article

Deleted user · Answer 40 · Mon May 02 2016 16:09:15 GMT+0800 (China Standard Time)

@zhuchiheng
Hi, I have been running this for a million (T = 109200)..
But it still scores not more than 18 (mostly 20 and 21)....
Can you tell me how was the trajectory of the scores for you over T?

Deleted user · Answer 41 · Tue May 03 2016 01:07:43 GMT+0800 (China Standard Time)

@miyosuda
Does the code work for you now?

Kosuke Miyoshi · Answer 42 · Tue May 03 2016 12:23:38 GMT+0800 (China Standard Time)

@aravindsrinivas
Simple grid maze task converged to optimal result easily with this code, but the pong's result is same as yours (not more than 17 and mostly 20 and 21 after one day learning)

Deleted user · Answer 43 · Tue May 03 2016 14:19:55 GMT+0800 (China Standard Time)

Did you mean -17 and -20/21?
Also, what's the value of T you get after a day? 1.2 million is too slow :-(

On 3 May 2016 09:53, "Kosuke Miyoshi" notifications@github.com wrote:

@aravindsrinivas https://github.com/aravindsrinivas
Simple grid maze task converged to optimal result easily with this code,
but the pong's result is same as yours (not more than 17 and mostly 20 and
21 after one day learning)

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1 (comment)

Joakim Bergdahl · Answer 44 · Tue May 03 2016 16:42:33 GMT+0800 (China Standard Time)

With the correct ALE I also wind up at -17 at most after a days worth of training in pong. In breakout, the score maxes out at 2-3 for each episode

Kosuke Miyoshi · Answer 45 · Tue May 03 2016 17:28:39 GMT+0800 (China Standard Time)

@aravindsrinivas

Did you mean -17 and -20/21?

Yes, sorry it means -17 and -20/21.

Also, what's the value of T you get after a day?

After 19 hour 20 min, global T was 10870045 (10.8 million) on my Core i7-6700 machine.
(So 19.5 steps per sec for one thread)

By the way, I found ALE 0.5 has default setting of "repeat_action_probability=0.25."
This parameter was introduced from ALE 0.5 and it causes poor performance.

https://groups.google.com/forum/#!topic/deep-q-learning/p4FAIaabwlo

So I disabled this with repeat_action_probability=0.0 in "ale.cfg".

Deleted user · Answer 46 · Tue May 03 2016 19:05:38 GMT+0800 (China Standard Time)

@miyosuda

Isn't that equivalent to 43.2 mill frames? Which should be between 10 and 11 epochs.... Their graph shows that by 10 epochs, they are able to reach scores around +10. So, this is definitely not working..

Kosuke Miyoshi · Answer 47 · Tue May 03 2016 23:23:37 GMT+0800 (China Standard Time)

@aravindsrinivas

Isn't that equivalent to 43.2 mill frames?

Yes it is 43.2 frames including skipped ones.

By the way, I'm still not sure I'm understanding what "epsilon" parameter you mentioned before in the comment means.

The decay parameter (called alpha in the paper) for RMSProp was 0.99 and the regularization constant (called epsilon in the paper) was 0.1.

Is this epsilon of RMSProp in equation (9) or the "beta" parameter in equation (7)?
Is it possible to share original message that DeepMind author sent you?

I'll try easier task than pong to confirm correctness of my implementation.

zhuchiheng · Answer 48 · Wed May 04 2016 00:15:02 GMT+0800 (China Standard Time)

Hi @aravindsrinivas, try these hyper parameters:
T = 0 # Global shared counter
TMAX = 5000000 # Max iteration of global shared counter
THREADS = 8 # Number of running thread
N_STEP = 5 # Number of steps before update
WISHED_SCORE = 3000 # Stopper of iterative learning
GAMMA = 0.99 # Decay rate of past observations deep q-learning
ACTIONS = 1 # Number of valid actions
STATES = 4 # Number of state
ENTROPY_BETA = 0.001 # Entropy regulation term: beta, default: 0.001

INIT_LEARNING_RATE = 0.0001 # default: 1e-3

OPT_DECAY = 0.99 # Discouting factor for the gradient, default: 0.99
OPT_MOMENTUM = 0.0 # A scalar tensor, default: 0.0
OPT_EPSILON = 0.1 # 0.005 # value to avoid zero denominator, default: 0.01

Deleted user · Answer 49 · Wed May 04 2016 13:29:28 GMT+0800 (China Standard Time)

@zhuchiheng
Is this for a continuous world problem? Asking because you have only 1 action? I was actually talking with respect to Pong.

Kosuke Miyoshi · Answer 50 · Thu May 05 2016 05:49:48 GMT+0800 (China Standard Time)

@aravindsrinivas
There is another project trying a3c, and the result seems so much better than mine.
https://github.com/muupan/async-rl

Please try his implementation and setting.

Kosuke Miyoshi · Answer 51 · Thu May 05 2016 20:14:07 GMT+0800 (China Standard Time)

There was a progress with learning.

Today I've changed some parameters.

t_max = 5
RMSprop epsilon = 0.1

and I've changed loss function a bit and the gradient of critic is now half of before, following muupan's setting.
(Learning rate of critic was already being half of actor's but I make it more half of before.)

40c65d4

tf.nn.l2_loss() means

output = sum(t ** 2) / 2

and I used to use loss function like below before.

output = sum(t ** 2)

I'm not confident now, but the learning rate ratio of actor/critic might be important.

I'be been running only 13 hours, but the score began increasing with this setting.

I'll keep running this setting to see the result.

Joakim Bergdahl · Answer 52 · Fri May 06 2016 17:36:49 GMT+0800 (China Standard Time)

@miyosuda
I have started the same experiment (from current master) on a single Intel Xeon e5-2680 (however with PARALLEL_SIZE of 8 as in your implementation). I have two Xeons I can utilize, is there any setup I should go for that you would like to see?

Kosuke Miyoshi · Answer 53 · Fri May 06 2016 18:08:41 GMT+0800 (China Standard Time)

@joabim
It's great to hear that you have Xeon machine.
Xeon e5-2680 has 8cores and can run 16 threads, correct?
So could you try PARALLEL_SIZE of 16?

I have two Xeons I can utilize,

Does this mean you have two PCs and each PC has single Xeon chip?

Anyway I would like you to try 16 parallel threads to check parallel size increment will speed up learning!

As far as I'm testing with current master setting, the result in 38 hours (18 million steps) is like this

And the movie capture after 24 hours learning was like this

https://www.youtube.com/watch?v=cFWL_y9BVaQ

Joakim Bergdahl · Answer 54 · Fri May 06 2016 18:24:05 GMT+0800 (China Standard Time)

@miyosuda
It's even the third revision (12 cores + 12 virt), I am lucky to be able to borrow this machine! Hence, 16 threads should not be a problem, I have just reset the test for parallel size 16, let's see how it performs!

Yes, the second xeon is in another PC so I can't promise anything, but it might be possible to run a parallel test if we use a parallel size >12 - perhaps two runs of 20 threads or so.

The results are promising, we start to converge! I think we can reach article-grade results (~20 points after 10 hours of training) soon

Kosuke Miyoshi · Answer 55 · Fri May 06 2016 21:17:22 GMT+0800 (China Standard Time)

@joabim

It's even the third revision (12 cores + 12 virt),

Wow so great!

I'm curious about what happens if we run it with 24 threads.
When you run with 16 or 24 threads, is it possible to see the CPU usage of each CPUs?

When I run this program with 8 threads, CPU usage of each CPU was around 80%.

As @originholic suggests, multi threading in Python seems to be not efficient due to Global Interpreter Lock (GIL).

Most of the learning process is calculated inside TensorFlow with c++, so I'm not sure how much GIL affects the performance, but if the CPU usages is low when we increase thread size, I need to replace threading module which I'm using now to multiprocessing module.

So I'm happy if I can see the CPU usage in you environment. Thanks!

Joakim Bergdahl · Answer 56 · Fri May 06 2016 21:53:27 GMT+0800 (China Standard Time)

This is the current CPU usage (about 85% per core)

Regarding the score recording for tensorboard, couldn't we record the average of the 16 threads? Sometimes the current high score is broken by a thread that might not be of index 0 which results in the summary_writers not recording the current "overall" performance of the agents.

Kosuke Miyoshi · Answer 57 · Fri May 06 2016 22:56:10 GMT+0800 (China Standard Time)

@joabim
Thanks! As far as I see your result, CPU usages seems ok. I'll stay multi threading at this moment.

Regarding the score recording for tensorboard, couldn't we record the average of the 16 threads?

I was thinking the same thing.
I've added modification to record scores from all threads and pushed it to "all_scores" branch.
I'll test this branch, and if there is no problem, I'll merge it to master.

(I'm not averaging the score, but does this help?)

Deleted user · Answer 58 · Fri May 06 2016 23:16:44 GMT+0800 (China Standard Time)

I think we should follow the epoch convention for testing, and only with
respect to the global network parameters. Not the thread parameters.

That is, we must train using all threads and update the global parameters
and periodically test only the global network. That is what has been done
in the paper. Every 4 million frames (1 mill steps - value of T), a testing
epoch must be conducted that will last 500000 frames (125000 steps). I
think this will make it better... What do you think?

On Fri, May 6, 2016 at 8:26 PM, Kosuke Miyoshi notifications@github.com
wrote:

@joabim https://github.com/joabim
Thanks! As far as I see your result, CPU usages seems ok. I'll stay multi
threading at this moment.

Regarding the score recording for tensorboard, couldn't we record the
average of the 16 threads?

I was thinking the same thing.
I've added modification to record scores from all threads and pushed it to
"all_scores" branch.
I'll test this branch, and if there is no problem, I'll merge it to master.

(I'm not averaging the score, but does this help?)

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1 (comment)

Aravind

Deleted user · Answer 59 · Fri May 06 2016 23:18:05 GMT+0800 (China Standard Time)

We also don't need to output the thread score (even if it is only thread 0)
during train phase.

This is just following the same convention as DQN in DeepMind's code or
Nathan Sprague's Lasagne implementation.

On Fri, May 6, 2016 at 8:46 PM, Aravind Srinivas L <
aravindsrinivas@gmail.com> wrote:

I think we should follow the epoch convention for testing, and only with
respect to the global network parameters. Not the thread parameters.

That is, we must train using all threads and update the global parameters
and periodically test only the global network. That is what has been done
in the paper. Every 4 million frames (1 mill steps - value of T), a testing
epoch must be conducted that will last 500000 frames (125000 steps). I
think this will make it better... What do you think?

On Fri, May 6, 2016 at 8:26 PM, Kosuke Miyoshi notifications@github.com
wrote:

@joabim https://github.com/joabim
Thanks! As far as I see your result, CPU usages seems ok. I'll stay multi
threading at this moment.

Regarding the score recording for tensorboard, couldn't we record the
average of the 16 threads?

I was thinking the same thing.
I've added modification to record scores from all threads and pushed it
to "all_scores" branch.
I'll test this branch, and if there is no problem, I'll merge it to
master.

(I'm not averaging the score, but does this help?)

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1 (comment)

Aravind

Aravind

originholic · Answer 60 · Sat May 07 2016 00:08:24 GMT+0800 (China Standard Time)

@miyosuda
Very glad to hear that have learning going on with the ale environment!!
Also I agree with the idea of @aravindsrinivas, how about we add one more thread specified for testing the global net, that 16 training thread + 1 validation thread, so just need to monitor the validation thread for the score according to the global T.

If using multiprocessing module, I think we can make the validation thread to handle all the I/O scheduling to speed up the training, I am currently working on the multiprocessing module since I want to improve the training speed of the continuous domain as well.

Kosuke Miyoshi · Answer 61 · Sat May 07 2016 04:34:57 GMT+0800 (China Standard Time)

@aravindsrinivas @originholic
Thank you for the suggestion. Let me think about how to do validation with global network efficiently.

I am currently working on the multiprocessing module since I want to improve the training speed of the continuous domain as well.

Great! If the performance increases, let me know! I think 80% of current CPU usage has a room to improve.

Joakim Bergdahl · Answer 62 · Mon May 09 2016 15:31:26 GMT+0800 (China Standard Time)

Alright, now I'm back and I get that we need to change the implementation. The results from my run during the weekend follows (even though they don't matter anymore)

Kosuke Miyoshi · Answer 63 · Mon May 09 2016 15:51:09 GMT+0800 (China Standard Time)

@joabim
Thank you for testing.

In my environment the result on my machine during this 4 days is

and the step size was around 53 million. (The learning rate becomes zero after 60 million steps)
Hmm what's the difference between your environment and mine...

Let try one more learning to check.

Kosuke Miyoshi · Answer 64 · Tue May 10 2016 04:49:39 GMT+0800 (China Standard Time)

I didn't noticed, but I found that, in muupan's project,

https://github.com/muupan/async-rl

the learning rates of actor/critic are opposite to mine.
The learning rate of actor is half of critic's, and his result is better than mine.

I'll try his setting in this branch.

https://github.com/miyosuda/async_deep_reinforce/tree/muupan_lr_setting

Kosuke Miyoshi · Answer 65 · Tue May 10 2016 06:06:13 GMT+0800 (China Standard Time)

After learning of 59 million steps, I visualized the weights of first convolution layer.

$ python a3c_visualize.py

The result was like this.

I think the second column represents the upside movement of the paddle. (One column represents 4 frames of the input.)

Joakim Bergdahl · Answer 66 · Tue May 10 2016 19:24:52 GMT+0800 (China Standard Time)

I realized I forgot to install the ale in my new anaconda environment after building it (the ALE fork that you made with correct pong support and multithreading) in my previous run which resulted in me using the incorrect version of ALE... I am redoing the test for 16 threads using the muupan_lr_setting branch now! I'll let you know how it goes

Kosuke Miyoshi · Answer 67 · Tue May 10 2016 19:38:24 GMT+0800 (China Standard Time)

@joabim
I see.
BTW, I'm going to ask muupan about his setting in his issues thread. He said that he asked DeepMind authors about tuning, and I hope I can apply his feedback to mine later.

Yasuhiro Fujita · Answer 68 · Tue May 10 2016 20:17:39 GMT+0800 (China Standard Time)

@miyosuda For everyone's information, I summarized about their settings here: https://github.com/muupan/async-rl/wiki

Kosuke Miyoshi · Answer 69 · Tue May 10 2016 22:07:20 GMT+0800 (China Standard Time)

@muupan Thank you!

Chris Ratcliff · Answer 70 · Fri Jun 17 2016 05:44:14 GMT+0800 (China Standard Time)

@miyosuda Hi. I’ve got an LSTM working with your code. I’ve only tested it on a toy problem (4 state MDP) rather than an Atari game but it seems to be working properly and as well as the feedforward net does. The code is at https://github.com/cjratcliff/async_deep_reinforce. I’ve made quite a few changes for my own version, many of them outside the LSTM parts so I’m happy to answer any questions. For using it on Atari, in addition to increasing the RNN size, I’d recommend changing the cell type from BasicRNNCell to BasicLSTMCell and removing the activation function argument to that function.

Kosuke Miyoshi · Answer 71 · Fri Jun 17 2016 12:12:17 GMT+0800 (China Standard Time)

@cjratcliff Thank you for sharing your LSTM version!!! Let me try it!!!

Kosuke Miyoshi · Answer 72 · Mon Jul 04 2016 01:47:55 GMT+0800 (China Standard Time)

@cjratcliff I've pushed my LSTM version.
To make pong work with LSTM, I added

Unrolling LSTM cell up to 5 time steps (LOCAL_T_MAX time steps).
Now the back-prop calculation is batched with unrolled LSTM.
Call actions.reverse(), states.reverse() etc.... again to change the input order as normal.
When calculating "R", I'm calling reverse() to make the calculation easier. (Because from the last state, R can be calculated recursively as written in the original paper.) So I called reverse() again to fix the order.

With LSTM, the score of pong hit the maximum score easily. Thanks.

Chris Ratcliff · Answer 73 · Thu Jul 07 2016 23:59:14 GMT+0800 (China Standard Time)

@miyosuda Great to see it working so well, thanks.

Itsukara · Answer 74 · Sun Jul 31 2016 01:45:34 GMT+0800 (China Standard Time)

@miyosuda
Your work is great!
I tried your program on a game "Breakout".
In one day training, I got 833 point as maximum score.

BTW, in training time, I encountered some trouble.
"pi" become NaN sometimes and the saved data was not usable for demo play.

The reason is that the "pi" has a possibility to become 0.0 and your code does not treat it correctly.
I think that you'd better to change following code in the file "game_ac_network.py".
I changed the code as follows and has no problems so far.

Current code:

entropy = -tf.reduce_sum(self.pi * tf.log(self.pi), reduction_indices=1)
policy_loss = - tf.reduce_sum( tf.reduce_sum( tf.mul( tf.log(self.pi), self.a ), reduction_indices=1 ) * self.td + entropy * entropy_beta )

My proposal:

entropy = -tf.reduce_sum(self.pi * tf.log(tf.clip_by_value(self.pi, 1e-20, 1.0)), reduction_indices=1)
policy_loss = - tf.reduce_sum( tf.reduce_sum( tf.mul( tf.log(tf.clip_by_value(self.pi, 1e-20, 1.0)), self.a ), reduction_indices=1 ) * self.td + entropy * entropy_beta )

Kosuke Miyoshi · Answer 75 · Mon Aug 01 2016 13:59:16 GMT+0800 (China Standard Time)

@Itsukara Sorry for late in reply (I didn't notice your post until now), and thank you for suggestion!
As you suggest, my code can't treat zero pi value. I'll test it and apply your fix to my repo later.
Thanks!

Sahil Sharma · Answer 76 · Fri Sep 16 2016 18:47:53 GMT+0800 (China Standard Time)

The code performs really well on some games but on others it doesn't quite get the same level of scores as those reported in the paper. I wonder why that is. For example in Space Invaders, the reported score is 23846.0. The model I trained comes nowhere near that. :( Did anyone else manage to get better than around 1500 for Space Invaders?

mw66 · Answer 77 · Sun Jan 22 2017 15:54:41 GMT+0800 (China Standard Time)

Just saw some discussion on using Multi-processing in this thread, I wonder what's the current status?

I opened a dedicated ticket on this:

#27

Dong Li · Answer 78 · Mon Apr 10 2017 16:26:23 GMT+0800 (China Standard Time)

Hi @miyosuda , thanks for sharing the code. I have a question about A3C LSTM implementation.

At class GameACLSTMNetwork, line 217, why to share the lstm weights among thread? Maybe it makes sense to create "no reuse" lstm weights for every worker and global_network and synchronize all the variables from global_network's.

Thanks!