haarnoja / sac

Soft Actor-Critic

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hyperparameter Advice

Random-Word opened this issue · comments

Hi Tuomas. I'm trying out your SAC implementation on some of the continuous gym environments and I'm curious if you have any recommendations for how to best tune the hyperparameters. Using the defaults and a temperature of 1, for instance, leads to some wildly oscillating policy performance on LunarLanderContinuous or InvertedPendulum. The policy may generate very good returns, then suddenly in the next entry in progress.csv terrible returns, and oscillates up and down without stabilizing. Does that suggest the temperature parameter needs to be tuned, or are some of the other default hyperparameters not ideal for these sorts of tasks?

An example of the episode return for lunar lander against samples:
lunarlandersac

Thanks!

Learning tasks that require high precision can be hard with maximum entropy rl algorithms. There can be lots of variation in returns since the optimal policy is stochastic. You can try evaluating the policy using the mean action when using a Gaussian representation, which could give more consistent results. Using the mean instead of sampling from the policy distribution is just a heuristic and is not guaranteed to work, but at least on simulated locomotion tasks it seems to help.

You can also take a look at the entropy to tune the reward scale. A good rule of thumb is to have average entropy roughly equal to the negative of the number of action dimensions.

It was using a deterministic evaluation policy, but the GMM policy without the reparamaterization trick was unstable over a wide range of temperatures for this task. Your update nine days ago to match the latest paper made the results much more reliable. Thanks!

image