Hyperparameter Advice

Question

Hyperparameter Advice

Random-Word opened this issue 6 years ago · comments

Hi Tuomas. I'm trying out your SAC implementation on some of the continuous gym environments and I'm curious if you have any recommendations for how to best tune the hyperparameters. Using the defaults and a temperature of 1, for instance, leads to some wildly oscillating policy performance on LunarLanderContinuous or InvertedPendulum. The policy may generate very good returns, then suddenly in the next entry in progress.csv terrible returns, and oscillates up and down without stabilizing. Does that suggest the temperature parameter needs to be tuned, or are some of the other default hyperparameters not ideal for these sorts of tasks?

An example of the episode return for lunar lander against samples:

Thanks!

Tuomas Haarnoja · Answer 1 · Sat Aug 04 2018 09:15:05 GMT+0800 (China Standard Time)

Learning tasks that require high precision can be hard with maximum entropy rl algorithms. There can be lots of variation in returns since the optimal policy is stochastic. You can try evaluating the policy using the mean action when using a Gaussian representation, which could give more consistent results. Using the mean instead of sampling from the policy distribution is just a heuristic and is not guaranteed to work, but at least on simulated locomotion tasks it seems to help.

You can also take a look at the entropy to tune the reward scale. A good rule of thumb is to have average entropy roughly equal to the negative of the number of action dimensions.

Ross Story · Answer 2 · Wed Aug 15 2018 10:16:10 GMT+0800 (China Standard Time)

It was using a deterministic evaluation policy, but the GMM policy without the reparamaterization trick was unstable over a wide range of temperatures for this task. Your update nine days ago to match the latest paper made the results much more reliable. Thanks!